CatLLM
Overview
CatLLM is an open-source Python and R ecosystem for systematic LLM-powered text classification. The cat-llm meta-package installs a family of domain-specific tools — survey responses, social media, academic papers, political text, web content, and more — all sharing the same classify() / extract() / summarize() API. Validated against expert human coders across multiple datasets.
The Ecosystem
Meta-package
cat-llm: The full ecosystem in one install (pip install cat-llm). Provides every domain package through a single import catllm namespace. [GitHub]
General-purpose base
cat-stack: The domain-agnostic classification engine underlying every other package. Use it directly for general text, or build your own domain wrapper on top of it. [GitHub]
Domain packages
cat-survey: Classify open-ended survey responses at scale. Verbose category definitions and ensemble voting handle ambiguity. [GitHub]
cat-pol: Classify political text — municipal ordinances, federal laws, executive orders, presidential speeches. Ships with 17 built-in datasets on HuggingFace, updated weekly. [GitHub]
cat-vader: Classify and analyze social media posts. Connects to the Threads API to pull your full post history, classify posts into custom categories, and return an enriched dataset with engagement metrics. [Learn more] [GitHub]
cat-ademic: Classify and summarize academic papers — abstracts, full texts, and research documents across disciplines. Built-in journal/field context. [GitHub]
cat-cog: Cognitive assessment scoring, including CERAD Constructional Praxis test evaluation for dementia research. [GitHub]
cat-web: Classify scraped web content — pages, articles, and HTML. Domain-tuned prompts for long-form online text. [GitHub]
Related packages
llm-web-research: A separate package in the CatLLM family for LLM-powered web research with a focus on accuracy over quantity. Uses a multi-step verification pipeline to catch ambiguous queries before returning potentially incorrect answers. [GitHub]
Web Apps
Classify Survey Responses: A web-based tool for categorizing survey responses without writing code.
Citation
If you use CatLLM in your research, please cite:
Soria, C. (2026). Scaling Open-Ended Survey Coding: An LLM Pipeline Where Definitions Do the Heavy Lifting. Journal of Open Source Software. https://doi.org/10.21105/joss.09678
