Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting
Published in Journal of Open Source Software, 2026
Recommended citation: Soria C. Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting. Journal of Open Source Software. 2026. doi:10.21105/joss.09678 https://doi.org/10.21105/joss.09678
CatLLM is an open-source Python and R toolkit for systematic, reproducible LLM-powered text classification. The package implements a provider-agnostic pipeline supporting frontier and open-weight models, with defaults calibrated against the consensus of double-blind coding by sociologists and demographers across multiple survey datasets. This short software paper documents the design, scope, and reproducibility guarantees of the toolkit; the full empirical validation is reported in a companion preprint under review at the Journal of Computational Social Science.
Soria C. Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting. Journal of Open Source Software. 2026. doi:10.21105/joss.09678
