Classifying Open-Ended Survey Responses
Source:vignettes/classifying-survey-responses.Rmd
classifying-survey-responses.RmdWhat cat.survey adds
cat.survey is a thin domain wrapper around
cat.stack that injects survey-question
context into every prompt. When you call
classify(input_data = ..., survey_question = "Why did you move?"),
the LLM sees:
“A respondent was asked: Why did you move? Their answer was: …”
That framing measurably improves accuracy on open-ended survey data versus generic text classification, because the model uses the question to disambiguate short or context-dependent responses.
Everything else — supported models, output format, ensemble voting,
batch mode — is identical to cat.stack.
Install
install.packages(
"cat.survey",
repos = c("https://chrissoria.r-universe.dev",
"https://cloud.r-project.org")
)
library(cat.survey)Quick classification
responses <- c(
"Took a new job in Chicago",
"Wanted to be closer to grandkids",
"Couldn't afford rent in the Bay Area",
"Job market collapsed after the layoffs",
"Family pressure to move home"
)
# Verbose category descriptions classify better than short labels.
verbose_cats <- c(
"Job/school: A change in employment, education, or career, including transfers and retirement.",
"Family: Relationship changes, having children, supporting relatives, or relocating to be near family.",
"Cost of living: Housing affordability, cost of goods, or general economic pressure.",
"Other: The response does not fit any of the above categories."
)
results <- classify(
input_data = responses,
categories = verbose_cats,
survey_question = "Why did you move to a new city?",
api_key = Sys.getenv("OPENAI_API_KEY"),
user_model = "gpt-4o-mini"
)Multi-label survey responses
Many survey responses fit more than one category (“I moved for a new
job and to be closer to family”). The default classifier is multi-label
— results will have one 0/1 column per category, and a row
can have multiple 1s.
To force single-label, set add_other = FALSE and shrink
the category list. To make multi-label explicit in your analysis, use
the binary columns directly:
Discovering a category scheme when you don’t have one
If you don’t already have a coding scheme, use extract()
to discover one from the responses themselves, then pass the result to
classify():
cats <- extract(
input_data = responses,
survey_question = "Why did you move to a new city?",
max_categories = 8L,
api_key = Sys.getenv("OPENAI_API_KEY")
)
cats$top_categories
# Optionally rewrite the labels to be more verbose, then classify:
results <- classify(
input_data = responses,
categories = cats$top_categories,
survey_question = "Why did you move to a new city?",
api_key = Sys.getenv("OPENAI_API_KEY")
)See also extract.Rmd in the
r-package/examples/ directory for a deeper walkthrough of
category discovery.
Recommendations for survey work
-
Always set
survey_question— it’s the whole point of usingcat.surveyovercat.stack. Without it you might as well usecat.stack::classify()directly. -
Write verbose category descriptions. A label like
"Family: relocating to be near family, having a child, divorce..."classifies several percentage points more accurately than just"Family". This is the single biggest accuracy lever. -
Include an “Other” category. Prevents the model
from forcing ambiguous responses into ill-fitting boxes.
cat.surveywill prompt to add one if you forget (add_other = "prompt"is the default). - Validate on a hand-coded subsample. For published research, never trust classifications without spot-checking against human coding on at least 50–100 responses.
Where to learn more
- Full Getting Started guide:
vignette("getting-started", package = "cat.llm") - Per-function reference:
?cat.survey::classify,?cat.survey::extract,?cat.survey::explore - Empirical best-practices research (incl. why verbose labels help) is in the project Python README.