Getting Started with CatLLM for R
Chris Soria
2026-05-18
Source:vignettes/getting-started.Rmd
getting-started.RmdIntroduction
CatLLM is an ecosystem of R packages that use large language models (LLMs) to categorize open-ended text — survey responses, social media posts, academic papers, policy documents, web content — at scale. It’s designed for researchers who want quantitative analysis of free-text data without manual coding or hiring research assistants.
CatLLM achieves 98% accuracy compared to human consensus on classification tasks using leading models such as GPT-5, Gemini 2.5 Pro, and Qwen 3. Validated against expert human coders across 21 LLMs and 4 surveys; see the SocArXiv preprint for methodology.
The R packages are thin reticulate wrappers
around the underlying Python implementation. Every parameter,
default, and behavior is identical to the Python version — only
the calling syntax differs. For deep conceptual content, advanced
configuration, or the full 50-parameter classify()
reference, the Python cat-llm README
is the canonical source.
Installation
Install the meta-package (brings in all 7 sub-packages) from R-universe:
install.packages(
"cat.llm",
repos = c("https://chrissoria.r-universe.dev",
"https://cloud.r-project.org")
)Or install a single domain package for a lighter footprint:
install.packages(
c("cat.stack", "cat.survey"),
repos = c("https://chrissoria.r-universe.dev",
"https://cloud.r-project.org")
)One-time setup: install the Python backend (requires Python 3.9+ on your system):
library(cat.llm)
install_cat_stack()
# With PDF processing support:
# install_cat_stack(pdf = TRUE)Quick Start
CatLLM is designed for building datasets at scale,
not one-off queries. While you can classify individual responses, its
primary purpose is batch processing entire text columns, image
collections, or PDF corpora into structured research datasets. All
outputs are R data.frames ready for analysis or CSV
export.
Option A — via the meta-package
library(cat.llm) attaches every domain package and
exposes domain-suffixed aliases (classify_survey(),
classify_political(), classify_social(),
etc.):
library(cat.llm)
api_key <- Sys.getenv("OPENAI_API_KEY")
# Domain-neutral classification (from cat.stack)
results <- classify(
input_data = c("I love this product!", "Terrible experience.", "It was fine."),
categories = c("Positive", "Negative", "Neutral"),
description = "Customer feedback sentiment",
api_key = api_key
)
# Survey classification — adds survey-tuned prompts
results <- classify_survey(
input_data = df$responses,
categories = c("Job change", "Family reasons", "Cost of living"),
survey_question = "Why did you move to a new city?",
api_key = api_key
)
# Academic paper classification — fetches by journal
results <- classify_academic(
input_data = NULL,
categories = c("Empirical", "Theoretical", "Review"),
journal_issn = "0894-4393",
paper_limit = 50L,
polite_email = "you@university.edu",
api_key = api_key
)
# Social media classification
results <- classify_social(
input_data = df$posts,
categories = c("Misinformation", "Opinion", "News"),
api_key = api_key
)
# Political text classification (built-in registered sources)
results <- classify_political(
source = "city_san_diego",
doc_type = "ordinance",
since = "2025-01-01",
n = 50L,
categories = c("Housing", "Public Safety", "Finance"),
api_key = api_key
)
# Cognitive assessment scoring (CERAD drawings)
scores <- cerad_drawn_score(
shape = "diamond",
image_input = df$drawing_paths,
api_key = api_key
)Option B — install only the domain you need
For a lighter dependency footprint, install only the package you actually use:
# install.packages("cat.survey", repos = ...)
library(cat.survey)
results <- cat.survey::classify(
input_data = df$responses,
categories = c("Job change", "Family reasons", "Cost of living"),
survey_question = "Why did you move to a new city?",
api_key = Sys.getenv("OPENAI_API_KEY")
)The two options produce identical results —
classify_survey() from cat.llm is just a thin
re-export of cat.survey::classify().
The Ecosystem
| Package | Domain | Wraps |
|---|---|---|
| cat.stack | General-purpose classification base |
classify, extract, explore,
summarize
|
| cat.survey | Open-ended survey responses | Adds survey_question= framing |
| cat.vader | Social media posts | Platform connectors (Threads, Reddit, Bluesky, etc.) |
| cat.ademic | Academic papers | OpenAlex-based journal/topic fetching, PDF support |
| cat.cog | Cognitive assessment scoring |
cerad_drawn_score() for CERAD constructional
praxis |
| cat.pol | Policy documents | 17 registered sources (ordinances, federal laws, EOs, political speech) |
| cat.web | Web content | Automatic URL fetching, web-context prompt injection |
| cat.llm | Meta-package (installs all 7) | Re-exports + domain-suffixed aliases |
Every domain package shares the same core API —
classify(), extract(), explore(),
summarize() (where applicable) — and depends on
cat.stack, which holds the underlying classification
engine.
Best Practices for Classification
These recommendations are based on empirical testing across 4 surveys, 4 models (7B to frontier-class), and 250-row subsamples compared against human-coded ground truth. They apply identically to R and Python.
What works
Detailed category descriptions — the single biggest
lever for accuracy. Instead of short labels like
"Job change", use verbose descriptions like
"The person had a job or school or career change, including transferred and retired."
Consistently improves accuracy by several percentage points across all
models.
verbose_categories <- c(
"Job/school: A change in employment, education, or career, including transfers and retirement.",
"Family: Relationship changes, having children, supporting relatives, or relocating to be near family.",
"Cost of living: Housing affordability, cost of goods, or general economic pressure.",
"Other: The response does not fit any of the above categories."
)
results <- classify(
input_data = df$responses,
categories = verbose_categories,
api_key = Sys.getenv("OPENAI_API_KEY")
)Include an “Other” category — a catch-all like
"Other: The response does not fit any of the above categories."
prevents the model from forcing ambiguous responses into ill-fitting
categories. By default, R wrappers will prompt to add one if your
category list lacks one (add_other = "prompt").
Few-shot examples
(example1–example6) — providing 2-4 labeled
examples can help, especially for weaker models. Effects are modest
(+0–1 pp on average) and model-dependent.
results <- classify(
input_data = df$responses,
categories = verbose_categories,
example1 = list(text = "Took a new job in Chicago", label = "Job/school"),
example2 = list(text = "Wanted to be closer to grandkids", label = "Family"),
api_key = Sys.getenv("OPENAI_API_KEY")
)Low temperature (creativity = 0) — for
classification, deterministic output is preferable. Higher temperatures
add noise without improving accuracy.
What doesn’t help (or hurts)
-
Chain of Thought
(
chain_of_thought = TRUE): no measurable improvement in our testing; slightly degraded performance for some models. Off by default. -
Chain of Verification
(
chain_of_verification = TRUE): uses ~4x the API calls for self-verification. Consistently reduced accuracy by 1–2 pp by retracting correct classifications. Not recommended for classification. -
Step-back prompting
(
step_back_prompt = TRUE): inconsistent — slight gains for weaker models (~+1.8 pp), slight losses for stronger ones (~−0.5 pp). Not recommended as a default. -
Context prompting
(
context_prompt = TRUE): no consistent benefit observed.
Configuration
Get an API key
Get an API key from your preferred provider:
| Provider | Where |
|---|---|
| OpenAI | platform.openai.com |
| Anthropic | console.anthropic.com |
| aistudio.google.com | |
| HuggingFace | huggingface.co/settings/tokens |
| xAI | console.x.ai |
| Mistral | console.mistral.ai |
| Perplexity | perplexity.ai/settings/api |
Most providers require adding a payment method. Store your key securely and never share it publicly.
Store your key in .Renviron (recommended)
Rather than pasting your key into scripts, store it in
~/.Renviron so it’s automatically available to every R
session:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
Then restart R and use:
api_key <- Sys.getenv("OPENAI_API_KEY")To find or edit your .Renviron:
usethis::edit_r_environ() # opens it for editing; creates if missingAfter saving, restart R for the changes to take effect.
Run entirely locally with no API key
For sensitive data, free use, or air-gapped environments, run against a local model via Ollama:
# In a separate terminal: install Ollama, then pull a model.
# Recommended (larger, more accurate, ~9 GB):
# ollama pull qwen2.5:14b
# Smaller fallback if disk/RAM constrained (~4.7 GB):
# ollama pull qwen2.5:7b
results <- classify(
input_data = df$responses,
categories = c("Positive", "Negative", "Neutral"),
user_model = "qwen2.5:14b", # or "qwen2.5:7b" if you pulled the smaller one
model_source = "ollama"
)⚠️ Disk-space heads-up:
qwen2.5:14bis ~9 GB on disk and Ollama needs roughly that much free during the download. Checkdf -h /first — if under ~12 GB free, useqwen2.5:7b.
No API key needed; your data never leaves the machine.
Supported Models
Specify any of these via user_model = "...":
-
OpenAI:
gpt-4o,gpt-4o-mini,gpt-4,gpt-5, … -
Anthropic:
claude-sonnet-4-20250514,claude-3-5-sonnet-20241022,claude-3-5-haiku-20241022, … -
Google:
gemini-2.5-flash,gemini-2.5-pro, … -
HuggingFace:
Qwen/Qwen3-235B,meta-llama/Llama-4-Scout,deepseek-ai/DeepSeek-V3, and thousands of community models -
xAI:
grok-2, … -
Mistral:
mistral-large-latest,pixtral-large-latest, … -
Perplexity:
sonar-large,sonar-small, … -
Ollama (local):
qwen2.5:14b(recommended, ~9 GB),qwen2.5:7b(smaller fallback, ~4.7 GB),llama3.1:8b, … (setmodel_source = "ollama")
Fully tested: OpenAI, Anthropic, Perplexity, Google Gemini (free tier has 5 RPM limit), HuggingFace, xAI, Mistral.
For best results when starting out, OpenAI
(gpt-4o-mini) or Anthropic
(claude-3-5-haiku-20241022) are cheap, fast, and
reliable.
Ensemble & multi-model classification
Run the same input through multiple models and combine results via majority voting. Often improves accuracy by reducing individual model biases.
results <- classify(
input_data = df$responses,
categories = verbose_categories,
models = list(
c("gpt-4o-mini", "openai", Sys.getenv("OPENAI_API_KEY")),
c("claude-3-5-haiku-20241022", "anthropic", Sys.getenv("ANTHROPIC_API_KEY")),
c("gemini-2.5-flash", "google", Sys.getenv("GOOGLE_API_KEY"))
),
consensus_threshold = "unanimous" # or 0.5 for majority, etc.
)The output data.frame includes per-model predictions
(e.g., category_1_gpt_4o_mini,
category_1_claude) plus a consensus column.
API Reference (brief)
Every parameter from the Python classify(),
extract(), explore(), and
summarize() functions is exposed in R with identical
semantics. The full per-parameter documentation lives in the in-R help
system and on the R-universe per-package reference manuals.
| Function | In-R help | Online |
|---|---|---|
Domain-neutral classify()
|
?cat.stack::classify |
cat.stack manual |
Survey classify()
|
?cat.survey::classify |
cat.survey manual |
Academic classify()
|
?cat.ademic::classify |
cat.ademic manual |
Political classify()
|
?cat.pol::classify |
cat.pol manual |
Web classify()
|
?cat.web::classify |
cat.web manual |
Social classify()
|
?cat.vader::classify |
cat.vader manual |
| CERAD scoring | ?cat.cog::cerad_drawn_score |
cat.cog manual |
| List registered policy sources | ?cat.pol::list_sources |
cat.pol manual |
For full conceptual coverage of every parameter — batch mode, prompt tuning, embeddings, JSON formatting, advanced ensemble configurations — see the Python README API Reference. The R wrappers expose every Python kwarg.
R ↔︎ Python type translation
When adapting Python examples from the project README, the table
below covers the syntax differences. All conversions are handled
automatically by reticulate::r_to_py() inside the R
wrappers — you write R, the wrapper passes Python.
| Python | R |
|---|---|
["a", "b", "c"] |
c("a", "b", "c") |
{"key": "value"} |
list(key = "value") |
True / False / None
|
TRUE / FALSE / NULL
|
[(model, provider, key), (...)] (ensemble) |
list(c(model, provider, key), c(...)) |
df['col'] |
df$col |
import catllm |
library(cat.llm) |
catllm.classify_survey(...) |
classify_survey(...) (after
library(cat.llm)) |
Where to go from here
Full conceptual reference: the Python cat-llm README — covers every parameter, advanced configuration, prompt tuning, embeddings, etc. Since R is a thin reticulate layer, every Python concept applies directly.
Per-package R reference manuals: https://chrissoria.r-universe.dev — pick a package, then click the “Reference Manual” link for full
@paramdocs.End-to-end smoke test: see
r-package/test-all-packages.Rin the GitHub repo — a single R script that installs all 8 packages and runs a minimal classification per package.Issues, questions, contributions: github.com/chrissoria/cat-llm/issues
-
Citation — if you use CatLLM in published research, please cite:
Soria, C. (2026). Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting. SocArXiv. https://osf.io/preprints/socarxiv/gjvcf_v1
and the software DOI:
Soria, C. (2026). CatLLM: A Reproducible Python Ecosystem for Generating, Assigning, and Scoring Open-Ended Text, Images, and Documents Across Research Domains (v3.0.0) [Software]. Zenodo. https://doi.org/10.5281/zenodo.19960067