New Preprint: Model Diversity Over Model Size — Unanimous LLM Ensembles Correct Over-Classification in Survey Coding

1 minute read

Published: June 04, 2026

I have a new preprint out on SocArXiv: Model Diversity Over Model Size: Unanimous LLM Ensembles Correct Over-Classification in Survey Coding. It’s a deep dive into one finding flagged in the CatLLM methods paper — that unanimous multi-model ensembling corrects over-classification — asking which ensemble ingredients actually drive the gain and where the gain shows up.

Abstract

Large language models are increasingly used to classify open-ended survey responses, but they systematically over-classify, assigning categories too liberally on ambiguous cases and producing high sensitivity but low precision. Drawing on the established principle that aggregating multiple noisy annotators outperforms any single annotator, we test whether ensembles of LLMs can correct this problem. Using four open-ended survey questions with human-coded ground truth (3,208 responses, 6 categories per question), we evaluate ensemble configurations across 16 models spanning three cost tiers and six providers.

Key Findings

Unanimous voting fixes the over-classification problem. On the most ambiguous categories, the false positive rate drops from 50% to 3% under unanimous agreement, and precision triples.
The gain concentrates where over-classification is worst. Subjectively ambiguous categories with fuzzy boundaries see large improvements; categories with clear criteria show no benefit. The pattern replicates across three independent datasets (UCNets, GoEmotions, the British Election Study).
Cross-provider model diversity is the active ingredient. Models from different providers make different errors on ambiguous cases, and consensus filters the idiosyncratic false positives. Temperature variation and within-family size scaling contribute essentially nothing.
Three cheap models can beat one expensive one. As few as three diverse lower-tier models suffice to reliably exceed GPT-5 on the ambiguous classification tasks where this matters most.

Implications

For the ambiguous classification problems common in open-ended survey research, the well-established annotation principle of multi-coder agreement transfers directly to LLMs: investing in diverse perspectives is more effective than investing in a single expensive model. Practically, this means survey researchers can often skip the frontier-tier subscription and instead orchestrate a small cross-provider ensemble — frequently at a fraction of the cost.

Read the full preprint on SocArXiv · Replication materials · CatLLM (the toolkit used to produce the classifications)

Share on

Threads Twitter Facebook LinkedIn

Contact Information

New Preprint: Model Diversity Over Model Size — Unanimous LLM Ensembles Correct Over-Classification in Survey Coding

Abstract

Key Findings

Implications

Share on

You May Also Enjoy

New Preprint: High Agreement, Different Stories — How LLM Classifiers Reshape Demographic Patterns in Survey Data

CatLLM Desktop: A Mac App for Classifying Text Without Writing Code

CatLLM R Package: Classify Survey Text with LLMs in R

What California Cities Actually Legislate: Classifying Municipal Ordinances with cat-pol