Wraps the Python cat_stack.extract() function. Discovers and returns a
normalised, deduplicated set of categories found in the input data.
Usage
extract(
input_data,
api_key,
input_type = "text",
description = "",
max_categories = 12L,
categories_per_chunk = 10L,
divisions = 12L,
user_model = "gpt-4o",
creativity = NULL,
specificity = "broad",
research_question = NULL,
mode = "text",
filename = NULL,
model_source = "auto",
iterations = 8L,
random_state = NULL,
focus = NULL,
chunk_delay = 0,
auto_start_ollama = TRUE
)Arguments
- input_data
A character vector, list, or
data.framecolumn. For images/PDFs, a directory path or character vector of file paths.- api_key
Character. API key for the model provider.
- input_type
Character. Type of input:
"text"(default),"image", or"pdf".- description
Character. The survey question or data description. Default
"".- max_categories
Integer. Maximum number of final categories to return. Default
12L.- categories_per_chunk
Integer. Categories to extract per data chunk. Default
10L.- divisions
Integer. Number of chunks to divide the data into. Default
12L.- user_model
Character. Model name. Default
"gpt-4o".- creativity
Numeric or
NULL. Temperature setting.NULLuses the provider default. DefaultNULL.- specificity
Character. Category granularity:
"broad"(default) or"specific".- research_question
Character or
NULL. Optional research context.- mode
Character. Processing mode. For PDFs:
"text"(default),"image", or"both". For images:"image"(default) or"both".- filename
Character or
NULL. Optional CSV filename to save results.- model_source
Character. Provider hint:
"auto","openai","anthropic","google", etc. Default"auto".- iterations
Integer. Number of passes over the data. Default
8L.- random_state
Integer or
NULL. Random seed for reproducibility.- focus
Character or
NULL. Optional focus for extraction (e.g.,"decisions to move").- chunk_delay
Numeric. Seconds between API calls (rate limiting). Default
0.0.- auto_start_ollama
Logical. If
TRUE(default), automatically callensure_ollama_running()whenmodel_source = "ollama". SetFALSEto skip the check (e.g. on CI).
Value
A named list with:
counts_dfA
data.frameof discovered categories with counts.top_categoriesA character vector of the top category names.
raw_top_textThe raw model output from the final merge step.
Examples
if (FALSE) { # \dontrun{
result <- extract(
input_data = df$responses,
description = "Why did you move to this city?",
api_key = Sys.getenv("OPENAI_API_KEY")
)
print(result$top_categories)
print(result$counts_df)
} # }