New Preprint: An Empirical Investigation into the Utility of Large Language Models in Open-Ended Survey Data Categorization

1 minute read

Published:

I have a new preprint out on SocArXiv: An Empirical Investigation into the Utility of Large Language Models in Open-Ended Survey Data Categorization.

Abstract

How effectively can Large Language Models (LLMs) approximate social scientist judgment in categorizing open-ended survey responses? This study compares eight contemporary LLMs—GPT-5, Claude Sonnet 4.5, Gemini 2.5 Flash, Grok 4 Fast, Qwen 3, DeepSeek v3.1, Llama 4, and Mistral Medium—to human annotators on 3,208 responses from the UC Berkeley Social Networks Study, spanning four question types and yielding 19,248 multi-label coding decisions.

Key Findings

  • Models do not reach human-like inter-rater reliability in comparisons with individual coders (Krippendorff’s alpha 0.58–0.59 vs. 0.77 for humans), yet they achieve high accuracy rates of 82–97% relative to a human consensus standard, depending on task complexity.
  • Accuracy declines for longer, more ambiguous responses and for rare thematic categories, indicating that model performance is sensitive to both response length and category prevalence.
  • Demographic differences in performance are present—for example, responses from female respondents are classified less accurately—but much of this gap is associated with differences in response style, such as length and complexity, rather than clearly attributable to direct demographic targeting.
  • These performance patterns produce narrative shifts that could lead qualitative researchers to reach different substantive conclusions, with models missing sociologically meaningful patterns in how some groups, such as Black respondents or women, describe the formation of friendships.

Implications

Overall, the results suggest that LLMs can substantially reduce coding burden but are best used to augment, rather than replace, human judgment when researchers seek to detect subtle, demographically inflected social patterns in qualitative data.

Read the full preprint on SocArXiv