Scaling Open-Ended Survey Coding: Definitions, Ensembles, and the Limits of Prompt Engineering

Date:

As large language model (LLM)–based text classification becomes routine in the social sciences, researchers confront dozens of competing models, inconsistent advice on prompting, and little standardized tooling with evidence-based defaults. CatLLM, an open-source Python and R package, addresses this gap with a three-stage pipeline — exploration, extraction, classification — for coding open-ended survey responses. The package supports multi-model ensembles, batch processing, and fully local deployment via open-weight models, allowing researchers working with sensitive data to avoid transmitting responses to external servers.

Its defaults are calibrated by a systematic empirical study evaluating 21 LLMs across three capability tiers, six providers, and four survey questions, benchmarked against sociologist-coded ground truth. This validation reveals a consistent problem: all models over-classify, with precision lagging 40–50 percentage points behind sensitivity, implying that default LLM configurations may substantially overstate category prevalence. CatLLM encodes empirically grounded mitigations as defaults, such as verbose category definitions with explicit inclusion and exclusion criteria, unanimous multi-model ensembling, and an automatic “Other” escape-valve category, while disabling advanced prompting strategies that show no reliable benefit. Notably, ensembles of inexpensive open-weight models outperform the best individual cloud model. An applied example uses CatLLM to code open-ended responses from the UC Berkeley Social Networks Study, in which younger and older adults explain why they can or cannot expect help from friends and family in an emergency, revealing distinct life-course patterns in perceived social support.