<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://christophersoria.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://christophersoria.com/" rel="alternate" type="text/html" /><updated>2026-03-23T09:41:11-07:00</updated><id>https://christophersoria.com/feed.xml</id><title type="html">Chris Soria</title><subtitle>PhD Candidate at UC Berkeley, Demography</subtitle><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><entry><title type="html">What California Cities Actually Legislate: Classifying Municipal Ordinances with CatLLM</title><link href="https://christophersoria.com/posts/2026/03/catpol-ordinance-analysis/" rel="alternate" type="text/html" title="What California Cities Actually Legislate: Classifying Municipal Ordinances with CatLLM" /><published>2026-03-22T00:00:00-07:00</published><updated>2026-03-22T00:00:00-07:00</updated><id>https://christophersoria.com/posts/2026/03/catpol-ordinance-analysis</id><content type="html" xml:base="https://christophersoria.com/posts/2026/03/catpol-ordinance-analysis/"><![CDATA[<p><img src="/images/catpol-ordinance-banner.png" alt="" /></p>

<!-- TODO: generate audio version -->

<p>Local laws shape daily life in ways that most people never see — your rent, your commute, what gets built on the corner lot — but they’re written in dense legal language, buried in city clerk archives, and produced at a volume no individual can keep up with. San Diego alone has passed nearly 90,000 ordinances and resolutions. San Francisco adds dozens per month. Journalists cover the headline votes; researchers study federal legislation; but the vast majority of municipal lawmaking happens without any systematic analysis at all.</p>

<p><strong><a href="https://github.com/chrissoria/cat-llm">cat-llm</a></strong> is designed to close that gap. It’s an open-source Python package that pulls municipal ordinances, federal laws, executive orders, and political speech directly from public datasets, then uses LLMs to classify, summarize, and analyze them at scale. It can take a 15,000-word ordinance written in statutory language and tell you, in plain English, what it does, who it affects, and where it falls on the political spectrum.</p>

<p>In this post, I used <a href="https://github.com/chrissoria/cat-llm">cat-llm</a> to classify 200 recent ordinances each from <strong>San Diego</strong> and <strong>San Francisco</strong> — two major California cities with different political characters, against two classification schemes: a 12-category policy taxonomy and a 3-category political lean assessment. The goal: a quantitative snapshot of what these cities legislate about and whether the ideological differences between them show up in the text of their laws.</p>

<p><em>Want to run this on your own data? Skip to the <a href="#how-to-run-it-yourself">methodology and replication section</a>. The entire pipeline is open source and the datasets are public.</em></p>

<hr />

<h2 id="background-cat-pol-and-the-data">Background: cat-pol and the Data</h2>

<p><strong><a href="https://github.com/chrissoria/cat-llm">cat-llm</a></strong> is an ecosystem of open-source Python packages that use LLMs to classify text at scale. Users interested specifically in political text analysis can install <strong><a href="https://pypi.org/project/cat-pol/">cat-pol</a></strong> (<code class="language-plaintext highlighter-rouge">pip install cat-pol</code>), which ships with 16 built-in political data sources on HuggingFace, including municipal ordinances from 12 California cities and counties, federal public laws, executive orders, presidential speeches, and Trump’s Truth Social posts, all accessible with a single <code class="language-plaintext highlighter-rouge">source=</code> parameter.</p>

<p>For this analysis, the data comes from two HuggingFace datasets:</p>
<ul>
  <li><strong><a href="https://huggingface.co/datasets/chrissoria/san-diego-ordinances">chrissoria/san-diego-ordinances</a></strong> — 87,983 records (ordinances + resolutions) going back to 1905</li>
  <li><strong><a href="https://huggingface.co/datasets/chrissoria/sf-ordinances">chrissoria/sf-ordinances</a></strong> — 4,048 ordinances going back to 2011</li>
</ul>

<p>Both datasets are scraped from official city clerk systems, include full ordinance text extracted from PDFs, and are updated weekly via automated scrapers. I took the 200 most recent ordinances with text from each city.</p>

<p>What follows is a demonstration of what you can learn from municipal legislation using a few lines of Python and no specialized legal knowledge. The entire analysis can be reproduced for free: the data is publicly hosted on <a href="https://huggingface.co/chrissoria">HuggingFace</a>, cat-llm is <a href="https://github.com/chrissoria/cat-llm">open source</a>, and the classification can run on free-tier HuggingFace models or local models via <a href="https://ollama.com">Ollama</a> with zero API costs. If you can write <code class="language-plaintext highlighter-rouge">pip install cat-llm</code>, you can replicate and extend every finding below.</p>

<hr />

<h2 id="san-francisco-passes-twice-as-many-laws">San Francisco Passes Twice as Many Laws</h2>

<p>Before asking <em>what</em> these cities legislate about, it’s worth asking <em>how much</em> they legislate.</p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>San Diego</th>
      <th>San Francisco</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Ordinances per month (2020–2025 avg)</td>
      <td>11.2</td>
      <td><strong>22.2</strong></td>
    </tr>
    <tr>
      <td>Ratio</td>
      <td> </td>
      <td><strong>2.0x</strong></td>
    </tr>
  </tbody>
</table>

<p>San Francisco passes roughly <strong>twice as many ordinances per month</strong> as San Diego: 22 vs 11. The pattern is stable across years, not a spike. SF consistently produces 20–25 ordinances per month; SD produces 9–15.</p>

<p>San Diego compensates with a high volume of <em>resolutions</em> (about 49 per month), which are administrative actions (contract approvals, budget items, proclamations) rather than new law. But in terms of actual lawmaking (text that creates, amends, or repeals municipal code), SF is substantially more productive.</p>

<p>This matters for everything that follows. When we compare the <em>share</em> of ordinances in each policy domain, we’re comparing slices of very different pies. A 24% business regulation rate in SF means roughly 5 new business ordinances per month. An 8% rate in SD means less than 1. Throughout the rest of this analysis, I’ll show both percentages and estimated monthly counts to keep the denominators honest.</p>

<hr />

<h2 id="what-do-these-cities-legislate-about">What Do These Cities Legislate About?</h2>

<p>Rather than imposing categories top-down, I used cat-llm’s extract function to discover 12 policy domains directly from the ordinance text. Categories are multi-label, so a single ordinance can be tagged with multiple domains. See the <a href="#methodology-notes">methodology section</a> for how the categories were generated and the full list.</p>

<h3 id="results-policy-domain-distribution">Results: Policy Domain Distribution</h3>

<p><img src="/images/catpol-policy-distribution.png" alt="" /></p>

<p><img src="/images/catpol-policy-gap.png" alt="" /></p>

<p><em>200 most recent ordinances per city. Multi-label (rows sum to &gt;100%). Model: Qwen 2.5-72B-Instruct.</em></p>

<h3 id="what-the-numbers-mean">What the Numbers Mean</h3>

<p>The two cities legislate about fundamentally different things.</p>

<p><strong>San Diego is a building city.</strong> Nearly half (43%) of its recent ordinances touch infrastructure and public works: road improvements, water main replacements, sewer projects, construction contract extensions. Even though SD passes half as many ordinances overall, it still produces <strong>more infrastructure legislation per month in absolute terms</strong> than SF (4.8 vs 3.3). This isn’t just proportional: SD is genuinely more active on physical infrastructure than SF is. The same holds for parks (1.8 vs 2.3/month, close despite SD’s half-sized legislative output). SD’s legislative agenda reads like a city physically constructing and maintaining itself.</p>

<p><strong>San Francisco is a regulating city.</strong> Its ordinances spread across a wider range of policy domains, with no single category exceeding 27%. The biggest concentrations are in zoning and land use (6.0/month), revenue and financing (5.6/month), and business regulation (5.4/month). Health and social services, a category that barely registers in San Diego at 0.7 ordinances per month, hits <strong>5.3 per month</strong> in SF. That’s not a rounding difference; it’s a 7x gap. Housing policy runs at roughly triple SD’s rate (3.7 vs 1.3/month). San Francisco’s legislative output reads like a city managing social complexity: who can build what, under what conditions, with what protections for whom.</p>

<p>The environmental protection rate is nearly identical between the two cities (~17%), suggesting this is a baseline concern for California municipalities regardless of political character. Tax increases are rare in both (under 3%), which likely reflects the political difficulty of explicit tax votes at the local level.</p>

<blockquote>
  <p><strong>SF legislates more about business in <em>both</em> directions.</strong> San Francisco scores higher than San Diego on business regulation (24% vs 8%) <em>and</em> pro-business/economic development (17% vs 5%). It creates more rules and more incentives. San Diego’s approach to business is to leave it alone. The legislative silence is itself a policy choice.</p>
</blockquote>

<blockquote>
  <p><strong>SF actively raises revenue to fund social programs.</strong> SF generates 25% of its ordinances around revenue and financing, compared to SD’s 16%. Combined with SF’s 24% rate on health and social services (vs SD’s 6%), the picture is a city that raises money to fund an interventionist social agenda. San Diego raises less and spends what it raises on concrete: roads, pipes, parks.</p>
</blockquote>

<p>Put differently: <strong>San Diego legislates like a city that wants to run well. San Francisco legislates like a city that wants to do good.</strong> Whether “doing good” through regulation and social programs actually produces better outcomes is a separate empirical question, but the legislative priorities are unmistakable in the data. SD’s council spends its time keeping the lights on and the water flowing. SF’s council spends its time deciding who gets housing protections, which businesses need new permits, and how to fund homelessness services.</p>

<p>As a robustness check, I re-ran the analysis on 1,000 ordinances per city using GPT-4o with a different, data-driven category scheme. Different model, five times the sample, looser categories. Same conclusion: SD dominates on infrastructure and construction; SF dominates on health, environment, and housing. The pattern holds.</p>

<hr />

<h2 id="part-2-do-ordinances-have-a-political-lean">Part 2: Do Ordinances Have a Political Lean?</h2>

<h3 id="category-setup">Category Setup</h3>

<p>For the political lean analysis, I used three categories designed to capture ideological orientation:</p>

<ol>
  <li>
    <p><strong>Conservative/Right-Leaning Policy</strong> — deregulation, tax cuts, pro-business measures, law enforcement expansion, property rights protections, reduced government intervention, privatization of services</p>
  </li>
  <li>
    <p><strong>Progressive/Left-Leaning Policy</strong> — new regulations, tax increases, tenant protections, environmental mandates, social services expansion, equity/inclusion initiatives, labor protections, police reform</p>
  </li>
  <li>
    <p><strong>Neutral</strong> — routine contract approvals, procedural amendments, election scheduling, civil service appointments, budget housekeeping with no policy direction</p>
  </li>
</ol>

<p>This classification used <strong>Qwen3-235B</strong>, the flagship thinking model, to see whether reasoning capability helps with the more nuanced task of ideological classification.</p>

<h3 id="results-political-lean-distribution">Results: Political Lean Distribution</h3>

<p><img src="/images/catpol-political-lean.png" alt="" /></p>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>San Diego</th>
      <th>San Francisco</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Conservative/Right-Leaning</td>
      <td>2.6%</td>
      <td><strong>23.1%</strong></td>
    </tr>
    <tr>
      <td>Progressive/Left-Leaning</td>
      <td>19.9%</td>
      <td><strong>31.8%</strong></td>
    </tr>
    <tr>
      <td>Neutral</td>
      <td><strong>77.6%</strong></td>
      <td>45.6%</td>
    </tr>
  </tbody>
</table>

<p><em>200 most recent ordinances per city. Multi-label classification. Model: Qwen3-235B (thinking model).</em></p>

<p>The results are counterintuitive, and that’s what makes them interesting.</p>

<p><strong>San Diego’s ordinances are overwhelmingly neutral.</strong> Nearly 78% of SD’s recent ordinances carry no detectable ideological lean. These are contract extensions, budget transfers, construction authorizations. The machinery of a city that governs by administration rather than ideology. Only 3% code as conservative and 20% as progressive.</p>

<p><strong>San Francisco legislates politically in both directions.</strong> SF has a higher progressive rate (32% vs 20%), which aligns with expectations. But the surprise is SF’s conservative rate: <strong>23% vs SD’s 3%</strong>. San Francisco produces nearly 9x more conservative-coded ordinances than San Diego.</p>

<p>This isn’t because San Francisco has a secret conservative agenda. It’s because SF <em>actively legislates</em> about the domains that register as ideological: business regulation, tax policy, development incentives, policing. When you pass an ordinance streamlining permits for small businesses, that codes as pro-business/conservative. When you pass an ordinance adding tenant protections, that codes as progressive. San Diego doesn’t pass either ordinance. It passes a contract amendment instead.</p>

<p>The neutral rate tells the real story. SD’s 78% neutral rate means its city council spends most of its time on administrative governance. SF’s 46% neutral rate means less than half of its legislative output is purely procedural. The majority of SF ordinances take a policy stance of some kind. <strong>SF doesn’t just lean left — it legislates ideologically, period.</strong> It takes more stances in more directions than San Diego takes in any direction.</p>

<hr />

<h2 id="methodology-notes">Methodology Notes</h2>

<p>A few important details about how this analysis works and where it might break.</p>

<p><strong>Multi-label classification.</strong> Each ordinance can be tagged with multiple categories simultaneously. An infrastructure bond gets both “Infrastructure” and “Revenue and Financing.” This means percentages sum to more than 100%. That’s by design, not a bug.</p>

<p><strong>Model choice.</strong> I used two models: <strong>Qwen 2.5-72B-Instruct</strong> for the 12-category policy domain classification and <strong>Qwen3-235B</strong> (a 235-billion parameter mixture-of-experts thinking model) for the political lean analysis. Both are open-source models accessed via HuggingFace’s inference API, no OpenAI dependency. The robustness check used GPT-4o on a larger sample to confirm the results hold across model families.</p>

<p><strong>Category discovery.</strong> Categories weren’t hand-picked. I used <code class="language-plaintext highlighter-rouge">catllm.extract_policy()</code> to sample 50 ordinances and let the LLM discover recurring themes, then semantically merged duplicates into a clean taxonomy. The <code class="language-plaintext highlighter-rouge">specificity="specific"</code> parameter ensures category names include examples, which significantly improves classification accuracy over bare labels. The 12 policy domain categories used:</p>

<ol>
  <li>Tax Increases 2. Revenue and Financing 3. Budget and Appropriations 4. Housing and Residential Development 5. Zoning and Land Use Changes 6. Infrastructure and Public Works 7. Business Regulation 8. Pro-Business and Economic Development 9. Environmental Protection 10. Public Safety 11. Health and Social Services 12. Parks, Recreation, and Culture</li>
</ol>

<p>For the political lean analysis, three categories: Conservative/Right-Leaning Policy, Progressive/Left-Leaning Policy, and Neutral.</p>

<p><strong>Limitations.</strong> There’s no ground truth here — no human-coded comparison set for municipal ordinances. The classifications reflect what the model <em>thinks</em> these ordinances are about, not an objective standard. Model bias is a real concern, especially for the political lean analysis; LLMs have known tendencies in how they interpret political language. The sample sizes (200 for the Qwen runs, 1,000 for the GPT-4o robustness check) are reasonable but not exhaustive. The full classified datasets (1,700 SD ordinances, 3,900 SF ordinances) are now public on HuggingFace for anyone who wants to validate or extend this analysis.</p>

<hr />

<h2 id="the-public-datasets">The Public Datasets</h2>

<p>All data and classification results are publicly available:</p>

<ul>
  <li><strong>San Diego ordinances</strong>: <a href="https://huggingface.co/datasets/chrissoria/san-diego-ordinances">chrissoria/san-diego-ordinances</a> (87,983 records)</li>
  <li><strong>San Francisco ordinances</strong>: <a href="https://huggingface.co/datasets/chrissoria/sf-ordinances">chrissoria/sf-ordinances</a> (4,048 records)</li>
  <li><strong>Classification results</strong>: <!-- TODO: upload classified CSVs to HF --></li>
</ul>

<p>The source registry includes 16 datasets across California cities and counties, federal legislation, and social media:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catllm</span>

<span class="n">catllm</span><span class="p">.</span><span class="n">list_policy_sources</span><span class="p">()</span>  <span class="c1"># see all 16 sources
</span></code></pre></div></div>

<hr />

<h2 id="how-to-run-it-yourself">How to Run It Yourself</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>cat-llm          <span class="c"># full ecosystem</span>
<span class="c"># or: pip install cat-pol    # just the political text package</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catllm</span>

<span class="c1"># Discover categories from your data
</span><span class="n">categories</span> <span class="o">=</span> <span class="n">catllm</span><span class="p">.</span><span class="n">extract_policy</span><span class="p">(</span>
    <span class="n">source</span><span class="o">=</span><span class="s">"city_san_diego"</span><span class="p">,</span>
    <span class="n">doc_type</span><span class="o">=</span><span class="s">"ordinance"</span><span class="p">,</span>
    <span class="n">n</span><span class="o">=</span><span class="mi">50</span><span class="p">,</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-key"</span><span class="p">,</span>
    <span class="n">user_model</span><span class="o">=</span><span class="s">"Qwen/Qwen2.5-72B-Instruct"</span><span class="p">,</span>
    <span class="n">model_source</span><span class="o">=</span><span class="s">"huggingface"</span><span class="p">,</span>
    <span class="n">max_categories</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span>
    <span class="n">specificity</span><span class="o">=</span><span class="s">"specific"</span><span class="p">,</span>
<span class="p">)</span>

<span class="c1"># Classify using discovered categories
</span><span class="n">results</span> <span class="o">=</span> <span class="n">catllm</span><span class="p">.</span><span class="n">classify_policy</span><span class="p">(</span>
    <span class="n">source</span><span class="o">=</span><span class="s">"city_san_diego"</span><span class="p">,</span>
    <span class="n">categories</span><span class="o">=</span><span class="n">categories</span><span class="p">[</span><span class="s">"top_categories"</span><span class="p">],</span>
    <span class="n">doc_type</span><span class="o">=</span><span class="s">"ordinance"</span><span class="p">,</span>
    <span class="n">n</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-key"</span><span class="p">,</span>
    <span class="n">user_model</span><span class="o">=</span><span class="s">"Qwen/Qwen2.5-72B-Instruct"</span><span class="p">,</span>
    <span class="n">model_source</span><span class="o">=</span><span class="s">"huggingface"</span><span class="p">,</span>
<span class="p">)</span>

<span class="c1"># Summarize in plain language
</span><span class="n">summaries</span> <span class="o">=</span> <span class="n">catllm</span><span class="p">.</span><span class="n">summarize_policy</span><span class="p">(</span>
    <span class="n">source</span><span class="o">=</span><span class="s">"city_san_diego"</span><span class="p">,</span>
    <span class="n">n</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span>
    <span class="nb">format</span><span class="o">=</span><span class="s">"bullets"</span><span class="p">,</span>
    <span class="n">tone</span><span class="o">=</span><span class="s">"eli5"</span><span class="p">,</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-key"</span><span class="p">,</span>
<span class="p">)</span>

<span class="c1"># Optimize prompts with user feedback
</span><span class="n">result</span> <span class="o">=</span> <span class="n">catllm</span><span class="p">.</span><span class="n">prompt_tune_policy</span><span class="p">(</span>
    <span class="n">source</span><span class="o">=</span><span class="s">"city_san_diego"</span><span class="p">,</span>
    <span class="n">categories</span><span class="o">=</span><span class="p">[</span><span class="s">"Housing"</span><span class="p">,</span> <span class="s">"Public Safety"</span><span class="p">,</span> <span class="s">"Finance"</span><span class="p">],</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-key"</span><span class="p">,</span>
    <span class="n">sample_size</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
<span class="p">)</span>

<span class="c1"># Fetch raw data (no API key needed)
</span><span class="n">df</span> <span class="o">=</span> <span class="n">catllm</span><span class="p">.</span><span class="n">fetch_policy_source</span><span class="p">(</span><span class="s">"city_san_diego"</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">doc_type</span><span class="o">=</span><span class="s">"ordinance"</span><span class="p">)</span>
</code></pre></div></div>

<p>cat-llm handles the data pull, classification, and output in a consistent pipeline. Every dataset in the registry uses the same <code class="language-plaintext highlighter-rouge">source=</code> parameter across all six functions (<code class="language-plaintext highlighter-rouge">classify_policy</code>, <code class="language-plaintext highlighter-rouge">extract_policy</code>, <code class="language-plaintext highlighter-rouge">explore_policy</code>, <code class="language-plaintext highlighter-rouge">summarize_policy</code>, <code class="language-plaintext highlighter-rouge">prompt_tune_policy</code>, <code class="language-plaintext highlighter-rouge">fetch_policy_source</code>).</p>

<hr />

<hr />

<h2 id="what-this-tells-us">What This Tells Us</h2>

<p>The consistent finding across every analysis (different models, different sample sizes, different category schemes) is that San Diego and San Francisco govern in fundamentally different modes.</p>

<p>San Diego governs through <strong>administration</strong>: infrastructure projects, contract management, budget operations. Its ordinances are largely neutral, procedural, and focused on keeping the physical city running. When SD does legislate on policy, it tilts modestly progressive, but most of its legislative energy goes to the apolitical work of urban maintenance.</p>

<p>San Francisco governs through <strong>policy</strong>: zoning, business regulation, health services, housing, environmental mandates. Its ordinances are more likely to carry an ideological valence, progressive <em>and</em> conservative — because SF actively codifies political values into law. It produces twice as many ordinances per month, and more than half of them take a policy stance.</p>

<p>Neither mode is inherently better. SD’s approach keeps government lean and focused; SF’s approach uses legislation as a tool for social intervention. But the data makes clear that these aren’t just different political leanings. They’re different <em>theories of what city government is for</em>.</p>

<h3 id="whats-next">What’s Next</h3>

<p>This analysis covers two cities. cat-pol ships with 16 data sources (and growing): 12 California cities, San Diego County, federal public laws, executive orders, presidential speeches, and Trump’s Truth Social posts. The same classification pipeline can be applied to any of them. Some questions I haven’t answered:</p>

<ul>
  <li><strong>Time series</strong>: Has SF always been this regulatory, or did it shift after a particular election?</li>
  <li><strong>City-size effects</strong>: Do smaller cities (Salinas, Clovis) look more like SD or SF?</li>
  <li><strong>County vs city</strong>: SD County just went live as a dataset — does county governance look different from city governance in the same jurisdiction?</li>
  <li><strong>Federal comparison</strong>: How do municipal policy domains map onto federal legislation?</li>
</ul>

<p>If you build something with cat-pol or the datasets, reach out at <a href="mailto:chrissoria@berkeley.edu">chrissoria@berkeley.edu</a>.</p>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="LLM" /><category term="political science" /><category term="municipal policy" /><category term="cat-pol" /><category term="NLP" /><category term="open source" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Upcoming Presentations and Conferences (Summer–Fall 2026)</title><link href="https://christophersoria.com/posts/2026/03/upcoming-presentations-2026/" rel="alternate" type="text/html" title="Upcoming Presentations and Conferences (Summer–Fall 2026)" /><published>2026-03-13T00:00:00-07:00</published><updated>2026-03-13T00:00:00-07:00</updated><id>https://christophersoria.com/posts/2026/03/upcoming-presentations-2026</id><content type="html" xml:base="https://christophersoria.com/posts/2026/03/upcoming-presentations-2026/"><![CDATA[<p>I have a busy conference season ahead. Here is where I’ll be presenting and what I’ll be talking about.</p>

<p><strong>PAA 2026</strong> — St. Louis, May 9</p>

<p><strong>AAPOR 2026</strong> — Los Angeles, May 13. Oral presentation on the <a href="/catllm/">CatLLM</a> pipeline for LLM-based survey response classification.</p>

<p><strong>EdDem Workshop</strong> — UW-Madison, May 14–15</p>

<p><strong>Sunbelt 2026</strong> — Daytona Beach, June 22–28. Oral presentation on social isolation, loneliness, and cognitive decline.</p>

<p><strong>ASA 2026</strong> — New York, August 7–11 <em>(pending)</em></p>

<p><strong>ΨMCA 2026</strong> — August 9–14. Presentation on algorithmic classification in dementia research.</p>

<p><strong>USC Gateway Brownbag</strong> — Fall semester</p>

<p><strong>GSA 2026</strong> — National Harbor, MD, November 4–7 <em>(pending)</em></p>

<p>The presentations span all three of my research streams: networks and cognitive aging (Sunbelt), computational methods for survey research (AAPOR), and applying algorithmic classification to dementia research (ΨMCA). If you’ll be at any of these, I’d love to connect.</p>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="Social Networks" /><category term="Cognitive Aging" /><category term="Large Language Models" /><category term="Dementia" /><category term="Conferences" /><summary type="html"><![CDATA[I have a busy conference season ahead. Here is where I’ll be presenting and what I’ll be talking about.]]></summary></entry><entry><title type="html">What Bluesky’s Most-Followed Accounts Actually Post About</title><link href="https://christophersoria.com/posts/2026/03/catvader-bluesky-analysis/" rel="alternate" type="text/html" title="What Bluesky’s Most-Followed Accounts Actually Post About" /><published>2026-03-05T00:00:00-08:00</published><updated>2026-03-05T00:00:00-08:00</updated><id>https://christophersoria.com/posts/2026/03/catvader-bluesky-analysis</id><content type="html" xml:base="https://christophersoria.com/posts/2026/03/catvader-bluesky-analysis/"><![CDATA[<p><img src="/images/catvader-bluesky-banner.png" alt="" /></p>

<audio controls="" style="width:100%">
  <source src="https://huggingface.co/datasets/chrissoria/blog-audio/resolve/main/catvader-bluesky-analysis.mp3" type="audio/mpeg" />
</audio>
<!-- Audio generated with edge_tts (en-US-BrianNeural) via convert_blog_to_audio.py in the repo root.
     To regenerate: cd chrissoria.github.io && python3 convert_blog_to_audio.py
     Hosted on HuggingFace: https://huggingface.co/datasets/chrissoria/blog-audio -->

<p>A few days ago I used <strong><a href="https://pypi.org/project/cat-vader/">cat-vader</a></strong> to analyze my own Threads feed — 582 posts, 9 categories, and a slightly uncomfortable amount of self-reflection. That analysis asked a personal question: what do <em>I</em> post about?</p>

<p>This one asks a broader question. I took the same classification pipeline and pointed it at ten of Bluesky’s most-followed accounts: AOC, Mark Cuban, Mark Hamill, The Onion, George Takei, The New York Times, Rachel Maddow, Stephen King, MeidasTouch, and NPR. These accounts span the ecosystem — a sitting congresswoman, a billionaire, a pop culture icon, a satirical news outlet, two major media organizations, an advocacy network, and a horror novelist. 250 posts per account, 2,500 posts total, all classified by GPT-4o-mini against a nine-category scheme.</p>

<p>What do Bluesky’s most-followed accounts post about? And more interestingly: once you control for <em>who</em> is posting, does content type actually predict engagement?</p>

<hr />

<h2 id="background-cat-vader-and-the-dataset">Background: cat-vader and the Dataset</h2>

<p><strong><a href="https://github.com/chrissoria/cat-vader">cat-vader</a></strong> is a fork of my open-source survey classification package <strong><a href="https://github.com/chrissoria/cat-llm">cat-llm</a></strong>, adapted for social media analysis. The core idea is simple: you give it a list of posts and a set of categories with descriptions, and it uses an LLM to check each post against each category independently. Because categories aren’t mutually exclusive, a single post can belong to multiple categories simultaneously — a Rachel Maddow post about deportation policy might be tagged as both Politics &amp; Elections and Social Issues &amp; Justice. That multi-label design is what makes the output useful for analysis rather than just binning.</p>

<p>For this analysis I pulled 250 posts from each of ten accounts using cat-vader’s <code class="language-plaintext highlighter-rouge">sm_source="bluesky"</code> integration, classified them all using GPT-4o-mini, and exported everything to a single CSV that’s now publicly available on Hugging Face at <strong><a href="https://huggingface.co/datasets/chrissoria/bluesky-top10-classified">chrissoria/bluesky-top10-classified</a></strong>.</p>

<p>One important note before diving in: <strong>Bluesky’s API returns <code class="language-plaintext highlighter-rouge">views = 0</code> for all posts.</strong> The platform either doesn’t track impression counts or doesn’t expose them. That means the engagement analysis here focuses entirely on <strong>likes and replies</strong> — two signals that are real and public, but not the same as reach.</p>

<hr />

<h2 id="category-setup">Category Setup</h2>

<p>Before running classification, I defined nine categories designed to capture the thematic range of these accounts. Rather than bare labels, each category gets a description that guides the model on borderline cases.</p>

<p><strong>1. Politics &amp; Elections</strong> — Posts about electoral dynamics, political parties, voting, candidates, or partisan maneuvering. Includes commentary on legislative proceedings, election results, and the behavior of political figures as political actors.</p>

<blockquote>
  <p><em>“I honestly believe our most powerful position in a toxic time that feeds on cynicism, apathy, &amp; despair is to genuinely care and act for a better world.”</em> — Alexandria Ocasio-Cortez (146,230 likes)</p>
</blockquote>

<p><strong>2. Trump &amp; MAGA Criticism</strong> — Posts directly targeting Donald Trump, his administration, his supporters, or the MAGA movement. Includes both policy critiques and character commentary.</p>

<blockquote>
  <p><em>“Trump Suffers Setback Unrelated To Child Rape”</em> — The Onion (10,746 likes)</p>
</blockquote>

<p><strong>3. Social Issues &amp; Justice</strong> — Posts about systemic inequality, civil rights, immigration enforcement, discrimination, or other social conditions. Focus is observational or normative rather than electoral.</p>

<blockquote>
  <p><em>“The owners of a Dallas County warehouse that ICE had planned to use as a mega detention center said Monday it will not sell or lease the property to the federal government. ‘God answered our prayers,’ the Hutchins Mayor said.”</em> — Rachel Maddow (21,132 likes)</p>
</blockquote>

<p><strong>4. News &amp; Current Events</strong> — Posts reporting on, linking to, or discussing recent news stories across any domain. Includes breaking news, investigative stories, and news aggregation.</p>

<blockquote>
  <p><em>“An NPR investigation finds the public database of Epstein files is missing dozens of pages related to sexual abuse accusations against President Trump.”</em> — NPR (10,185 likes)</p>
</blockquote>

<p><strong>5. Entertainment &amp; Pop Culture</strong> — Posts about film, television, music, celebrity, sports, books, or cultural moments. Includes personal fandom and cultural commentary.</p>

<blockquote>
  <p><em>“Bluesky is collegial and interesting, the way Twitter used to be. Bonus: most people can spell.”</em> — Stephen King (80,664 likes)</p>
</blockquote>

<p><strong>6. Humor &amp; Satire</strong> — Posts that are primarily comedic in intent: jokes, satirical takes, absurdist commentary, or ironic framings of current events.</p>

<blockquote>
  <p><em>“Netanyahu Calls Iran Strikes Necessary To Prevent War He Just Started”</em> — The Onion (23,571 likes)</p>
</blockquote>

<p><strong>7. Science &amp; Technology</strong> — Posts about scientific findings, technology developments, AI, climate science, medicine, or the intersection of tech and society.</p>

<p><strong>8. Economy &amp; Business</strong> — Posts about financial markets, economic conditions, corporate news, consumer prices, trade policy, or business developments.</p>

<blockquote>
  <p><em>“Mr. Cuban — I just wanted to quickly thank you. My husband has [cancer]. We went to pick up his medication and were informed it was $29,000. We were able to get it from CostPlus for $99.”</em> — Mark Cuban (12,276 likes)</p>
</blockquote>

<p><strong>9. Personal &amp; Lifestyle</strong> — Posts that are personal in nature: life updates, reflections, expressions of mood, personal milestones, or non-political opinion.</p>

<blockquote>
  <p><em>“We were married on this day in 1978… soulmates ever since.”</em> — Mark Hamill (53,889 likes)</p>
</blockquote>

<hr />

<h2 id="what-blueskys-top-accounts-post-about">What Bluesky’s Top Accounts Post About</h2>

<p><img src="/images/bluesky-category-distribution.png" alt="" /></p>

<p>The overall landscape has a clear hierarchy. <strong>News &amp; Current Events</strong> dominates at 58.7% of all 2,500 posts — nearly three in five posts link to or discuss a recent story. <strong>Politics &amp; Elections</strong> comes in second at 50.0%, meaning half of all posts across these accounts touch on politics in some form. Then there’s a significant drop to <strong>Social Issues &amp; Justice</strong> (26.3%), <strong>Entertainment &amp; Pop Culture</strong> (23.0%), and <strong>Humor &amp; Satire</strong> (18.0%). <strong>Trump &amp; MAGA Criticism</strong> sits at 16.2%. At the bottom: <strong>Economy &amp; Business</strong> (13.2%), <strong>Personal &amp; Lifestyle</strong> (8.1%), and <strong>Science &amp; Technology</strong> (5.8%).</p>

<p>The concentration at the top isn’t surprising for this particular set of accounts — these aren’t lifestyle influencers or tech reviewers. But the degree of dominance by News and Politics is still striking. More than half of what this slice of Bluesky produces is essentially political journalism or political commentary.</p>

<p><img src="/images/bluesky-category-by-account.png" alt="" /></p>

<p>The account-level breakdown is where things get interesting.</p>

<p><strong>NPR</strong> is the most single-mindedly focused: 93% of its posts fall under News &amp; Current Events, and 60% under Politics &amp; Elections. Almost nothing else. The Humor &amp; Satire bar is essentially invisible (1%). NPR’s Bluesky presence is a straight news wire.</p>

<p><strong>The Onion</strong> is the mirror image in the best way. 81% of Onion posts are tagged Humor &amp; Satire — the highest satire rate of any account — but also 60% Entertainment &amp; Pop Culture. What’s slightly surprising is that only 21% of The Onion’s posts are tagged Politics &amp; Elections directly, and only 9% Trump &amp; MAGA Criticism. Satirical headlines about Trump do get tagged politics when the framing is explicitly electoral, but a lot of Onion content uses political <em>subjects</em> in the service of pure absurdism, which the model correctly separates out.</p>

<p><strong>AOC</strong> leads in Politics &amp; Elections (68% of her posts), Social Issues &amp; Justice (36%), and News &amp; Current Events (55%). Almost no Humor, very little Economy. Her Bluesky presence reads like exactly what it is: a member of Congress doing political communication full-time.</p>

<p><strong>Mark Cuban</strong> is the biggest outlier in the dataset. He’s the only account where Economy &amp; Business is the dominant theme at 58% — more than any other category. He’s also one of the few accounts where Trump &amp; MAGA Criticism is near zero (under 1%). Cuban posts about healthcare costs, business models, tariff economics, and policy mechanics. His account has almost nothing in common with the others topically.</p>

<p><strong>Stephen King</strong> is the most balanced. He mixes Entertainment &amp; Pop Culture (62%), Politics &amp; Elections (27%), Humor &amp; Satire (23%), and News &amp; Current Events (24%) roughly evenly. His account feels like a person who actually posts about many things, not a media organization running a content strategy.</p>

<p><strong>Mark Hamill</strong> tilts heavily toward Entertainment &amp; Pop Culture (based on his posting about Star Wars, his marriage, personal reflections) alongside Politics — a mix that reflects his public persona as both a cultural figure and a vocal political voice.</p>

<hr />

<h2 id="who-gets-the-most-engagement">Who Gets the Most Engagement?</h2>

<p><img src="/images/bluesky-likes-by-account.png" alt="" /></p>

<p>There is not a close race here. <strong>AOC averages 34,674 likes per post</strong> — more than twice the second-place Mark Hamill at 12,049. After that: Stephen King (5,039), MeidasTouch (4,226), Rachel Maddow (3,744), The Onion (1,598), George Takei (1,016), NPR (303), Mark Cuban (139), and NYT at 91.</p>

<p>The top 10 most-liked posts in the dataset are all AOC’s, with the most-liked reaching 167,305 likes on a post about a protest in Tucson: <em>“Original projected attendance was 3,000 people. 23,000 showed up.”</em></p>

<p>The NPR and NYT numbers are particularly striking — both major media organizations with massive follower counts, but averaging under 300 and 100 likes per post respectively. The media organizations are generating engagement on Bluesky at a fraction of the rate of individual personalities. Whether that reflects the algorithm, audience behavior, content style, or all three is hard to disentangle from this data alone.</p>

<p><img src="/images/bluesky-engagement-by-category.png" alt="" /></p>

<p>By category, the engagement hierarchy tells a clean story. <strong>Social Issues &amp; Justice</strong> leads at 8,819 average likes per post, followed by <strong>Politics &amp; Elections</strong> at 7,795 and <strong>Trump &amp; MAGA Criticism</strong> at 6,849. <strong>Economy &amp; Business</strong> posts average just 2,790 likes; <strong>Science &amp; Technology</strong> posts average only 706 — by far the lowest of any category.</p>

<p>But engagement averages are heavily influenced by <em>who</em> posts in each category. AOC posts almost entirely in Politics and Social Issues; Mark Cuban posts almost entirely in Economy. These raw category averages are partly measuring account popularity, not content appeal in isolation.</p>

<hr />

<h2 id="what-predicts-likes">What Predicts Likes?</h2>

<p>To separate content from creator, I ran two regression models with <code class="language-plaintext highlighter-rouge">log(likes + 1)</code> as the outcome — a log transformation that handles the extreme right skew in likes (the median is 919, the max is 167,305).</p>

<p>The first model includes only the eight category indicators, with <strong>Personal &amp; Lifestyle</strong> as the omitted reference category. The second adds account fixed effects, which isolates the content effect from the very large differences in per-account audience.</p>

<p><img src="/images/bluesky-regression-likes.png" alt="" /></p>

<p>Before controlling for account, seven of eight categories are statistically significant predictors of likes relative to Personal &amp; Lifestyle posts. The strongest positive predictors:</p>

<ul>
  <li><strong>Entertainment &amp; Pop Culture</strong>: +1.04 log-units (the largest positive coefficient)</li>
  <li><strong>Politics &amp; Elections</strong>: +1.01</li>
  <li><strong>Trump &amp; MAGA Criticism</strong>: +0.99</li>
  <li><strong>Humor &amp; Satire</strong>: +0.77</li>
  <li><strong>Social Issues &amp; Justice</strong>: +0.60</li>
  <li><strong>News &amp; Current Events</strong>: +0.29</li>
</ul>

<p>The two negative predictors are striking: <strong>Science &amp; Technology</strong> (−0.47) and especially <strong>Economy &amp; Business</strong> (−1.96) — the single largest coefficient in the model, and hugely negative. An economics post generates dramatically fewer likes than a comparable personal post in this raw model. The model explains 18.1% of the variance in log-likes.</p>

<p>But much of this is just reflecting AOC’s enormous engagement. AOC rarely posts about the Economy; Cuban almost exclusively does, and Cuban averages 139 likes per post. Once we account for who’s posting, the picture shifts considerably.</p>

<p><img src="/images/bluesky-regression-likes-adj.png" alt="" /></p>

<p>With account fixed effects added, R² jumps from 18.1% to 61.9% — the vast majority of the variance in likes is explained by <em>who</em> posted, not <em>what</em> they posted about. But the content effects don’t disappear; they just reorganize.</p>

<p><strong>Politics &amp; Elections</strong> (0.74), <strong>News &amp; Current Events</strong> (0.65), <strong>Social Issues &amp; Justice</strong> (0.57), <strong>Entertainment &amp; Pop Culture</strong> (0.39), <strong>Humor &amp; Satire</strong> (0.38), and <strong>Trump &amp; MAGA Criticism</strong> (0.30) all remain positive and significant even after controlling for account. Within any given account, these content types generate more likes than a baseline personal post.</p>

<p>Most striking: <strong>Economy &amp; Business</strong> goes from −1.96 to −0.002 (effectively zero, p = 0.99). The raw negative effect was entirely due to compositional differences between accounts — Cuban is the economy poster, and Cuban gets very few likes. Once you control for the account posting it, economics content performs no differently than personal content. The category itself isn’t the problem; the platform just happens to have its economics-focused account underperform on likes.</p>

<p><strong>Science &amp; Technology</strong> also loses its negative coefficient entirely once account is controlled (it becomes +0.25, non-significant). Science content doesn’t hurt engagement; it just happens to be posted by accounts that underperform on likes overall.</p>

<hr />

<h2 id="does-timing-matter">Does Timing Matter?</h2>

<p><img src="/images/bluesky-likes-by-weekday.png" alt="" /></p>

<p>The weekly pattern is clear. <strong>Friday is the best day to post</strong> at 8,522 average likes, followed by Tuesday (6,648), Wednesday (6,263), and Saturday (5,384). <strong>Sunday is the worst day</strong> at 3,325 average likes — about 60% lower than Friday’s average.</p>

<p>The spread across days is meaningful but not enormous compared to the account-level differences. Moving from Sunday to Friday would roughly double your average likes in this dataset. Moving from NPR to AOC would multiply them by about 380.</p>

<p><img src="/images/bluesky-engagement-scatter.png" alt="" /></p>

<p>The scatter plot shows every original post (log scale) colored by account. The AOC cluster sits visibly higher and to the right than every other account — more likes <em>and</em> more replies. The media organizations (NYT, NPR) cluster near the bottom. Most posts across all accounts sit in the low-engagement zone regardless of content type; the viral outliers are rare and concentrated in a small number of accounts.</p>

<hr />

<h2 id="the-public-dataset">The Public Dataset</h2>

<p>The full classified dataset — 2,500 posts, 10 accounts, 9 binary category columns, engagement metrics — is available at <strong><a href="https://huggingface.co/datasets/chrissoria/bluesky-top10-classified">chrissoria/bluesky-top10-classified</a></strong> on Hugging Face.</p>

<p>Each row is a post with columns for: <code class="language-plaintext highlighter-rouge">account_name</code>, <code class="language-plaintext highlighter-rouge">account_handle</code>, <code class="language-plaintext highlighter-rouge">timestamp</code>, <code class="language-plaintext highlighter-rouge">social_media_input</code> (the post text), <code class="language-plaintext highlighter-rouge">likes</code>, <code class="language-plaintext highlighter-rouge">replies</code>, <code class="language-plaintext highlighter-rouge">reposts</code>, <code class="language-plaintext highlighter-rouge">is_repost</code>, <code class="language-plaintext highlighter-rouge">category_1</code> through <code class="language-plaintext highlighter-rouge">category_9</code>, and several derived fields including <code class="language-plaintext highlighter-rouge">post_length</code>, <code class="language-plaintext highlighter-rouge">contains_url</code>, and <code class="language-plaintext highlighter-rouge">contains_image</code>. All classification was done by GPT-4o-mini in a single-pass, no-ensemble run. For research applications requiring higher accuracy, a multi-model ensemble is straightforward to add.</p>

<p>A few things you could do with this dataset that I haven’t:</p>
<ul>
  <li><strong>Sentiment analysis</strong> by account — which accounts skew optimistic vs. cynical?</li>
  <li><strong>Topic evolution over time</strong> — has the distribution shifted as the news cycle changed?</li>
  <li><strong>Reply rate vs. like rate</strong> — some content generates conversation without generating likes; which category does that?</li>
  <li><strong>Cross-platform comparison</strong> — the same 10 accounts presumably post on other platforms; how does Bluesky compare to X or Instagram in terms of what they share?</li>
</ul>

<hr />

<h2 id="how-to-run-it-yourself">How to Run It Yourself</h2>

<p>cat-vader handles the data pull and classification in a single call. Here’s the full workflow for replicating this analysis on any Bluesky account:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catvader</span> <span class="k">as</span> <span class="n">cv</span>

<span class="n">results</span> <span class="o">=</span> <span class="n">cv</span><span class="p">.</span><span class="n">classify</span><span class="p">(</span>
    <span class="n">sm_source</span><span class="o">=</span><span class="s">"bluesky"</span><span class="p">,</span>
    <span class="n">handle</span><span class="o">=</span><span class="s">"aoc.bsky.social"</span><span class="p">,</span>   <span class="c1"># any Bluesky handle
</span>    <span class="n">sm_posts</span><span class="o">=</span><span class="mi">250</span><span class="p">,</span>                <span class="c1"># number of recent posts to fetch
</span>    <span class="n">categories</span><span class="o">=</span><span class="p">[</span>
        <span class="s">"Politics &amp; Elections"</span><span class="p">,</span>
        <span class="s">"Trump &amp; MAGA Criticism"</span><span class="p">,</span>
        <span class="s">"Social Issues &amp; Justice"</span><span class="p">,</span>
        <span class="s">"News &amp; Current Events"</span><span class="p">,</span>
        <span class="s">"Entertainment &amp; Pop Culture"</span><span class="p">,</span>
        <span class="s">"Humor &amp; Satire"</span><span class="p">,</span>
        <span class="s">"Science &amp; Technology"</span><span class="p">,</span>
        <span class="s">"Economy &amp; Business"</span><span class="p">,</span>
        <span class="s">"Personal &amp; Lifestyle"</span><span class="p">,</span>
    <span class="p">],</span>
    <span class="n">description</span><span class="o">=</span><span class="s">"Social media posts from a public Bluesky account"</span><span class="p">,</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-openai-api-key"</span><span class="p">,</span>
    <span class="n">user_model</span><span class="o">=</span><span class="s">"gpt-4o-mini"</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">results</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">"bluesky_classified.csv"</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>

<p>The output is a DataFrame with one row per post and binary columns for each category. From there, any standard analysis pipeline applies — R, Python, whatever you prefer.</p>

<p>One difference from the Threads workflow: Bluesky doesn’t require OAuth or a developer account. The public API is unauthenticated for reading public posts. You just need an OpenAI (or other provider) key for the classification step.</p>

<hr />

<p>The consistent finding across both the Threads analysis and this one is that <em>who</em> is posting matters far more than <em>what</em> they post. Account identity — follower count, posting frequency, platform reputation — explains more of the variance in engagement than any content category. But content is not irrelevant: within any given account, political and social posts consistently outperform baseline personal posts, and that effect survives account controls.</p>

<p>For Bluesky specifically, the platform’s current character as a haven for politically engaged progressives shows up plainly in the data. Even The Onion — a satirical outlet that theoretically covers everything — lands 81% of its posts in Humor &amp; Satire while still earning engagement that tracks the political intensity of the moment. The platform has a distinct topical gravity, and the accounts doing best are the ones whose content aligns with it.</p>

<p>If you build something interesting with the dataset or the cat-vader pipeline, reach out at <a href="mailto:chrissoria@berkeley.edu">chrissoria@berkeley.edu</a>.</p>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="LLM" /><category term="social media" /><category term="bluesky" /><category term="cat-vader" /><category term="NLP" /><category term="open source" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Classifying Open-Ended Survey Responses Directly in Claude Code</title><link href="https://christophersoria.com/posts/2026/03/catllm-claude-code/" rel="alternate" type="text/html" title="Classifying Open-Ended Survey Responses Directly in Claude Code" /><published>2026-03-04T00:00:00-08:00</published><updated>2026-03-04T00:00:00-08:00</updated><id>https://christophersoria.com/posts/2026/03/catllm-claude-code</id><content type="html" xml:base="https://christophersoria.com/posts/2026/03/catllm-claude-code/"><![CDATA[<p><img src="/images/catllm-claude-code-banner.png" alt="" /></p>

<p>My open-source package <strong><a href="https://github.com/chrissoria/cat-llm">cat-llm</a></strong> has always been a Python-first tool: you install it, import it, write a script, run it. That works fine when you already have a pipeline set up. It’s friction when you just want to quickly classify a CSV someone sent you and move on.</p>

<p>Claude Code — Anthropic’s terminal-based coding agent — supports project-local slash commands: markdown files in <code class="language-plaintext highlighter-rouge">.claude/commands/</code> that inject a prompt and tool permissions when invoked. I added four of them to cat-llm. The result is that you can now classify survey data, extract categories, check your API keys, and run end-to-end tests without touching a Python file.</p>

<p>This post walks through the setup and shows what it looks like in practice using 40 rows of open-ended responses from the UCNets survey (variable <code class="language-plaintext highlighter-rouge">a19i</code>).</p>

<hr />

<h2 id="installation">Installation</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>cat-llm
</code></pre></div></div>

<p>That’s the only dependency. cat-llm pulls in pandas, tqdm, requests, openai, and anthropic automatically. For PDF classification support:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>cat-llm[pdf]
</code></pre></div></div>

<p>You’ll also need at least one provider API key. cat-llm reads from environment variables automatically:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">export </span><span class="nv">OPENAI_API_KEY</span><span class="o">=</span><span class="s2">"sk-..."</span>       <span class="c"># OpenAI / xAI</span>
<span class="nb">export </span><span class="nv">ANTHROPIC_API_KEY</span><span class="o">=</span><span class="s2">"sk-ant-..."</span>  <span class="c"># Anthropic</span>
<span class="nb">export </span><span class="nv">GOOGLE_API_KEY</span><span class="o">=</span><span class="s2">"AIza..."</span>      <span class="c"># Google</span>
</code></pre></div></div>

<p>Or drop them in a <code class="language-plaintext highlighter-rouge">.env</code> file at your project root — cat-llm will pick them up.</p>

<hr />

<h2 id="adding-the-claude-code-commands">Adding the Claude Code Commands</h2>

<p>Clone or navigate to your cat-llm working directory, then create the commands folder:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> <span class="nt">-p</span> .claude/commands/catllm
</code></pre></div></div>

<p>Add four markdown files — <code class="language-plaintext highlighter-rouge">classify.md</code>, <code class="language-plaintext highlighter-rouge">extract.md</code>, <code class="language-plaintext highlighter-rouge">providers.md</code>, and <code class="language-plaintext highlighter-rouge">test.md</code> — each containing a prompt that describes what Claude should do when the command is invoked. The full command definitions are in the <a href="https://github.com/chrissoria/cat-llm">cat-llm repository</a> under <code class="language-plaintext highlighter-rouge">.claude/commands/catllm/</code>.</p>

<p>Once the files exist, open (or reopen) a Claude Code session from the cat-llm project directory. Type <code class="language-plaintext highlighter-rouge">/catllm:</code> and tab-complete to see all four commands available.</p>

<hr />

<h2 id="checking-your-providers">Checking Your Providers</h2>

<p>Before classifying anything, run:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/catllm:providers
</code></pre></div></div>

<p>Claude detects which API keys are present in your environment, masks the values, and lists suggested model names for each configured provider:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=== cat-llm Provider Status ===

[OK] OpenAI / xAI
     Key: OPENAI_API_KEY = sk-p...k3Rw
     Models: gpt-5, gpt-4o-mini, grok-3

[OK] Anthropic
     Key: ANTHROPIC_API_KEY = sk-a...9xQz
     Models: claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5-20251001

[ ] Google (GOOGLE_API_KEY not set)
[ ] Mistral (MISTRAL_API_KEY not set)
[ ] HuggingFace (HUGGINGFACE_API_TOKEN not set)

[OK] Ollama (local)
     llama3.2:latest
     mistral:latest

Configured: 2 provider(s)
</code></pre></div></div>

<hr />

<h2 id="classifying-ucnets-a19i">Classifying UCNets <code class="language-plaintext highlighter-rouge">a19i</code></h2>

<p>The UCNets survey includes an open-ended question (<code class="language-plaintext highlighter-rouge">a19i</code>) asking respondents about sources of social support. The raw responses are text — exactly what cat-llm is designed for.</p>

<p>I have 40 responses in a CSV. The first step is letting the data suggest its own categories rather than imposing a scheme from the outside. The <code class="language-plaintext highlighter-rouge">/catllm:extract</code> command runs cat-llm’s <code class="language-plaintext highlighter-rouge">extract()</code> function, which samples random chunks of the responses, asks the model to name the recurring themes in each chunk, and returns a ranked frequency list:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/catllm:extract ucnets_a19i.csv --col response --model gpt-5
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=== Extracting categories from ucnets_a19i.csv ===
Column: response | Rows: 40 | Model: gpt-5

Discovered categories (ranked by frequency):
  1. Family and close relationships       (found in 18/20 chunks)
  2. Friends and peer support             (found in 15/20 chunks)
  3. Professional or institutional help   (found in 12/20 chunks)
  4. Online communities                   (found in 9/20 chunks)
  5. Religious or spiritual community     (found in 8/20 chunks)
  6. No support or social isolation       (found in 7/20 chunks)
  7. Coworkers or colleagues              (found in 5/20 chunks)

Suggested follow-up:
  /catllm:classify ucnets_a19i.csv \
    --col response \
    --categories "Family and close relationships,Friends and peer support,Professional or institutional help,Online communities,Religious or spiritual community,No support or social isolation,Coworkers or colleagues" \
    --output ucnets_a19i_classified.csv
</code></pre></div></div>

<p>The command even generates the follow-up classify call. Running it:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/catllm:classify ucnets_a19i.csv \
  --col response \
  --categories "Family and close relationships,Friends and peer support,Professional or institutional help,Online communities,Religious or spiritual community,No support or social isolation,Coworkers or colleagues" \
  --model gpt-5 \
  --output ucnets_a19i_classified.csv
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=== Classifying ucnets_a19i.csv ===
Column: response | Rows: 40 | Model: gpt-5
Categories: 7

   response                                          family  friends  professional  online  religious  isolated  coworkers
0  My sister and I talk every day, she really...        1       0          0          0         0          0         0
1  Mostly just online — Reddit communities wh...        0       0          0          1         0          0         0
2  My church group has been incredible, they ...        0       1          0          0         1          0         0
3  Honestly nobody. I've felt very alone thr...         0       0          0          0         0          1         0
4  My therapist and my wife, in that order...          1       0          1          0         0          0         0
...

--- Category Distribution (40 rows) ---
Family and close relationships      29  (72.5%)
Friends and peer support            18  (45.0%)
Professional or institutional help  11  (27.5%)
Religious or spiritual community     9  (22.5%)
Online communities                   8  (20.0%)
No support or social isolation       6  (15.0%)
Coworkers or colleagues              4  (10.0%)

Saved to ucnets_a19i_classified.csv
</code></pre></div></div>

<p>The output is a binary matrix — one column per category, one row per response. A single response can belong to multiple categories simultaneously, which is the right design for this kind of data: someone who mentions both their sister and their therapist gets a 1 in both <code class="language-plaintext highlighter-rouge">family</code> and <code class="language-plaintext highlighter-rouge">professional</code>, not a forced choice between them.</p>

<p>The classified CSV is ready for any downstream analysis — R, Stata, Python, whatever the pipeline requires.</p>

<hr />

<h2 id="running-the-quick-test">Running the Quick Test</h2>

<p>There’s also a built-in smoke test command that runs on the package’s bundled example data:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/catllm:test
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=== cat-llm Quick Test ===
File: examples/test_data/survey_responses.csv
Model: gpt-5
Categories: ['Positive', 'Negative', 'Neutral']

Loaded 20 rows. Columns: ['id', 'response']
Text column: response

--- Results ---
    response                                          Positive  Negative  Neutral
0   The program was incredibly helpful and I...           1        0        0
1   I didn't find the sessions useful at all...           0        1        0
2   It was okay, nothing special but not bad...           0        0        1
...

--- Category Distribution ---
Positive    11
Negative     5
Neutral      4

PASS: classify() completed successfully.
</code></pre></div></div>

<p>Useful for verifying a new API key or model name works before pointing it at real data.</p>

<hr />

<h2 id="what-else-you-can-do">What Else You Can Do</h2>

<p>The commands are thin wrappers around cat-llm’s Python API, which has considerably more surface area:</p>

<p><strong>Multi-model ensemble.</strong> Pass a list of models and cat-llm runs all of them in parallel, then reports a consensus column alongside each model’s individual output. Disagreements across models are flagged automatically.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catllm</span> <span class="k">as</span> <span class="n">cat</span>

<span class="n">result</span> <span class="o">=</span> <span class="n">cat</span><span class="p">.</span><span class="n">classify</span><span class="p">(</span>
    <span class="n">input_data</span><span class="p">,</span>
    <span class="n">categories</span><span class="p">,</span>
    <span class="n">models</span><span class="o">=</span><span class="p">[</span>
        <span class="p">(</span><span class="s">"gpt-5"</span><span class="p">,</span>                <span class="s">"openai"</span><span class="p">,</span>    <span class="n">openai_key</span><span class="p">,</span>    <span class="p">{}),</span>
        <span class="p">(</span><span class="s">"claude-sonnet-4-6"</span><span class="p">,</span>    <span class="s">"anthropic"</span><span class="p">,</span> <span class="n">anthropic_key</span><span class="p">,</span> <span class="p">{}),</span>
        <span class="p">(</span><span class="s">"gemini-2.0-flash"</span><span class="p">,</span>     <span class="s">"google"</span><span class="p">,</span>    <span class="n">google_key</span><span class="p">,</span>    <span class="p">{}),</span>
    <span class="p">],</span>
    <span class="n">consensus_threshold</span><span class="o">=</span><span class="s">"unanimous"</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>Extended reasoning.</strong> Pass <code class="language-plaintext highlighter-rouge">thinking_budget=4096</code> (Anthropic) or <code class="language-plaintext highlighter-rouge">chain_of_thought=True</code> for borderline classification tasks where you want the model to reason before committing to a label.</p>

<p><strong>PDF classification.</strong> If <code class="language-plaintext highlighter-rouge">input_data</code> points at a directory of PDFs, cat-llm renders each page and classifies the content as images. Useful for document archives or grant applications.</p>

<p><strong>Category discovery.</strong> <code class="language-plaintext highlighter-rouge">cat.extract()</code> is the same function the <code class="language-plaintext highlighter-rouge">/catllm:extract</code> command calls under the hood. You can run it directly, tune the <code class="language-plaintext highlighter-rouge">iterations</code> and <code class="language-plaintext highlighter-rouge">divisions</code> parameters, and use the raw frequency table to build a codebook before any classification happens.</p>

<hr />

<p>The cat-llm repository is at <strong><a href="https://github.com/chrissoria/cat-llm">github.com/chrissoria/cat-llm</a></strong> and the package is on PyPI as <code class="language-plaintext highlighter-rouge">cat-llm</code> (version 2.5.1 as of this writing). If you adapt the Claude Code commands for a different workflow or build something interesting with the package, reach out at <a href="mailto:chrissoria@berkeley.edu">chrissoria@berkeley.edu</a>.</p>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="LLM" /><category term="cat-llm" /><category term="Claude Code" /><category term="open source" /><category term="NLP" /><category term="survey research" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Analyzing My Threads Feed with cat-vader: LLM-Powered Social Media Classification at Scale</title><link href="https://christophersoria.com/posts/2026/03/catvader-threads-analysis/" rel="alternate" type="text/html" title="Analyzing My Threads Feed with cat-vader: LLM-Powered Social Media Classification at Scale" /><published>2026-03-02T00:00:00-08:00</published><updated>2026-03-02T00:00:00-08:00</updated><id>https://christophersoria.com/posts/2026/03/catvader-threads-analysis</id><content type="html" xml:base="https://christophersoria.com/posts/2026/03/catvader-threads-analysis/"><![CDATA[<p><img src="/images/catvader-banner.png" alt="" /></p>

<audio controls="" style="width:100%">
  <source src="https://huggingface.co/datasets/chrissoria/blog-audio/resolve/main/catvader-threads-analysis.mp3" type="audio/mpeg" />
</audio>
<!-- Audio generated with edge_tts (en-US-BrianNeural) via convert_blog_to_audio.py in the repo root.
     To regenerate: cd chrissoria.github.io && python3 convert_blog_to_audio.py
     Hosted on HuggingFace: https://huggingface.co/datasets/chrissoria/blog-audio -->

<p>I spend a lot of time on Threads. Over the past two and a half years I’ve posted nearly 900 times: opinions on politics, technology, research, culture, and whatever else caught my attention that day. But I’ve never sat down and actually looked at what I post about. What are my real preoccupations? What topics dominate my feed? Which posts actually get engagement?</p>

<p>This post is an attempt to answer those questions systematically, using an LLM-powered classification pipeline I built called <strong><a href="https://pypi.org/project/cat-vader/">cat-vader</a></strong> — a fork of my open-source survey classification package, <strong><a href="https://github.com/chrissoria/cat-llm">cat-llm</a></strong>, adapted for social media data.</p>

<hr />

<h2 id="background-cat-llm-and-cat-vader">Background: cat-llm and cat-vader</h2>

<p><a href="https://github.com/chrissoria/cat-llm">cat-llm</a> is an open-source Python package I originally built for classifying open-ended survey responses at scale. You give it a list of text responses and a set of categories, and it uses large language models to assign each response to one or more categories, with support for multi-model ensembles, chain-of-thought reasoning, and automatic category discovery. It was designed for researchers who need to code thousands of survey responses without manually reading each one.</p>

<p>The core architecture turned out to be highly resilient to different kinds of text input. Survey responses and social media posts are structurally similar — short, opinionated, often ambiguous text that needs to be bucketed into meaningful categories. So I cloned cat-llm into <strong><a href="https://github.com/chrissoria/cat-vader">cat-vader</a></strong>, stripped out the survey-specific scaffolding, and built a pipeline that can classify any collection of social media posts — whether you’re working from a scraped dataset, a platform export, or a direct API pull. For convenience, cat-vader also wires directly to the Threads API to pull your personal post history with engagement metrics in one call.</p>

<p>The goal of this post is to walk through that pipeline end-to-end using my own Threads feed as the example dataset, and then use the results to take an honest look at what I’ve been posting about. The same workflow applies to any corpus of social media text.</p>

<hr />

<h2 id="what-do-i-actually-post-about">What Do I Actually Post About?</h2>

<p>Before classifying anything, I needed to decide what categories to use. I could have imposed them from the top down — just picked eight topics that felt right — but that risks missing something real in my data, or imposing categories that don’t actually fit how I write. Instead, I used cat-vader’s <code class="language-plaintext highlighter-rouge">explore()</code> function to let the data suggest its own themes first.</p>

<p><code class="language-plaintext highlighter-rouge">explore()</code> works by repeatedly sampling random chunks of posts, asking the LLM to extract the most common topics from each chunk, and collecting all the extracted labels across many passes. It doesn’t merge or deduplicate — it returns every raw label string from every chunk across every iteration. The idea is that categories which appear frequently and consistently are the ones that genuinely characterize the corpus, while one-off labels are noise.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catvader</span> <span class="k">as</span> <span class="n">cv</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>

<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"threads_year.csv"</span><span class="p">)</span>
<span class="n">texts</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s">"text"</span><span class="p">].</span><span class="nb">str</span><span class="p">.</span><span class="nb">len</span><span class="p">()</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">][</span><span class="s">"text"</span><span class="p">].</span><span class="n">tolist</span><span class="p">()</span>  <span class="c1"># 582 posts with text
</span>
<span class="n">raw</span> <span class="o">=</span> <span class="n">cv</span><span class="p">.</span><span class="n">explore</span><span class="p">(</span>
    <span class="n">input_data</span><span class="o">=</span><span class="n">texts</span><span class="p">,</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-openai-api-key"</span><span class="p">,</span>
    <span class="n">description</span><span class="o">=</span><span class="s">"Social media posts about current events, politics, technology, culture, and personal opinions"</span><span class="p">,</span>
    <span class="n">user_model</span><span class="o">=</span><span class="s">"gpt-4o"</span><span class="p">,</span>
    <span class="n">iterations</span><span class="o">=</span><span class="mi">6</span><span class="p">,</span>
    <span class="n">divisions</span><span class="o">=</span><span class="mi">15</span><span class="p">,</span>
    <span class="n">max_categories</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
    <span class="n">categories_per_chunk</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p>This produced 720 raw category extractions across 6 iterations and 15 chunk divisions, yielding 229 unique label strings. I counted the frequency of each label and eyeballed the top results to identify which themes were genuinely dominant versus which were just slightly different phrasings of the same idea (e.g. “Economy”, “Economics”, “Economy and Business”, and “Economy and Finance” are all the same theme).</p>

<p>Here are the top categories by raw frequency:</p>

<table>
  <thead>
    <tr>
      <th>Category</th>
      <th>Times Found</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Politics</td>
      <td>58</td>
    </tr>
    <tr>
      <td>Technology</td>
      <td>52</td>
    </tr>
    <tr>
      <td>Social Issues</td>
      <td>39</td>
    </tr>
    <tr>
      <td>Personal Opinions</td>
      <td>27</td>
    </tr>
    <tr>
      <td>Technology and AI</td>
      <td>22</td>
    </tr>
    <tr>
      <td>Economics</td>
      <td>16</td>
    </tr>
    <tr>
      <td>Education</td>
      <td>16</td>
    </tr>
    <tr>
      <td>Health and Science</td>
      <td>16</td>
    </tr>
    <tr>
      <td>Culture</td>
      <td>16</td>
    </tr>
    <tr>
      <td>Economy</td>
      <td>13</td>
    </tr>
    <tr>
      <td>Personal Experiences</td>
      <td>12</td>
    </tr>
    <tr>
      <td>Culture and Society</td>
      <td>12</td>
    </tr>
    <tr>
      <td>Media and Communication</td>
      <td>10</td>
    </tr>
    <tr>
      <td>Personal Opinions and Experiences</td>
      <td>10</td>
    </tr>
    <tr>
      <td>Media and Entertainment</td>
      <td>9</td>
    </tr>
    <tr>
      <td>Economy and Business</td>
      <td>9</td>
    </tr>
    <tr>
      <td>Education and Academia</td>
      <td>9</td>
    </tr>
    <tr>
      <td>Science and Health</td>
      <td>9</td>
    </tr>
  </tbody>
</table>

<p>The signal is clear. Collapsing the variants down, eight themes dominate: <strong>Politics</strong>, <strong>Technology &amp; AI</strong>, <strong>Social Issues</strong>, <strong>Economics &amp; Finance</strong>, <strong>Health &amp; Science</strong>, <strong>Education &amp; Research</strong>, <strong>Culture &amp; Entertainment</strong>, and <strong>Personal</strong>. These became the starting point for the final category set.</p>

<h3 id="defining-my-categories">Defining My Categories</h3>

<p>Rather than using bare labels, I defined each category with a description and concrete examples. This follows best-practice category construction from my own empirical work on LLM classification: verbose categories with descriptions and examples significantly outperform bare labels, improving accuracy by reducing model ambiguity on borderline cases.</p>

<p>The <code class="language-plaintext highlighter-rouge">explore()</code> output pointed to the broad themes, but the final set draws heavily on my own domain knowledge of what I post about. I know I post a lot about politics, and I know that my political posts tend to fall into distinct registers — partisan frustration, specific policy arguments, and direct Trump commentary — that a generic “Politics” label would collapse together. I also know I post disproportionately about AI relative to most people, which warranted its own category rather than being folded into Technology. My final categories reflect both what my data showed and what I know about myself as a poster.</p>

<p><strong>1. Partisan Politics</strong> — Posts relating to partisanship directly or indirectly: references to political parties, political tribalism, electoral dynamics, or the behavior of politicians and political actors as representatives of a party or ideological bloc (e.g., “The Republican Party has moved too far right,” “Democrats keep losing working-class voters”).</p>

<blockquote>
  <p><em>“Either side of the political spectrum has little empathy for the other. They actively dislike each other. When an act of violence occurs, the first instinct is to ask which side did it.”</em></p>
</blockquote>

<p><strong>2. Policy Politics</strong> — Posts advocating for or critiquing specific policies or policy positions, independent of partisan framing: arguments about what government should or shouldn’t do, regulatory stances, or calls for systemic reform (e.g., “We need universal healthcare,” “Tech companies need stronger antitrust enforcement”).</p>

<blockquote>
  <p><em>“It should be illegal to create AI videos meant to mislead and misinform people about current events.”</em></p>
</blockquote>

<p><strong>3. Anti-Trump</strong> — Posts directly critiquing Donald Trump, his character, his decisions in office, his policies, or individuals and groups who support him or his agenda (e.g., “Trump’s tariffs are going to tank the economy,” “MAGA voters keep getting lied to”).</p>

<blockquote>
  <p><em>“Trump is doing a great job at driving global unity… in their opposition to the US as an international bully.”</em></p>
</blockquote>

<p><strong>4. Technology</strong> — Posts discussing technology in any form, broadly construed: software, hardware, consumer devices, platforms, the tech industry, or the societal implications of technological change (e.g., “Apple’s new chip is a generational leap,” “Social media is rewiring how we form opinions”).</p>

<blockquote>
  <p><em>“Anthropic is all the hype but OpenAI still has the best models sorry to say.”</em></p>
</blockquote>

<p><strong>5. Artificial Intelligence</strong> — Posts specifically about AI models, AI capabilities, or commentary on AI companies such as OpenAI, Anthropic, Google DeepMind, or xAI. This includes takes on specific models (GPT, Claude, Gemini, Grok), opinions on what AI can and cannot do, or the direction of the AI industry (e.g., “The AI hype cycle is showing cracks,” “LLMs are great at pattern matching but terrible at actual reasoning”).</p>

<blockquote>
  <p><em>“In my opinion, the only people who are saying LLMs will someday automate all jobs don’t really understand the technology.”</em></p>
</blockquote>

<p><strong>6. Social Issues</strong> — Posts about social conditions, inequality, discrimination, or systemic patterns in society, without explicitly advocating for a specific policy response or criticizing political leadership. The focus is observational or normative about society itself rather than prescriptive about what government should do (e.g., “The wealth gap between generations is unlike anything we’ve seen,” “Racism in hiring is still very much alive”).</p>

<blockquote>
  <p><em>“The most deflating thing about this whole thing is how two people will view the same video and come to entirely different conclusions.”</em></p>
</blockquote>

<p><strong>7. Shit Posting</strong> — Low-effort, irreverent, or deliberately provocative posts with no pretense of serious commentary. The tone is casual to the point of flippant, the take is blunt, and the goal is more to express a vibe than make an argument (e.g., “Astrology is BS,” “Nobody actually likes networking events”).</p>

<blockquote>
  <p><em>“Daily reminder that astrology is still BS.”</em></p>
</blockquote>

<p><strong>8. Economics &amp; Finance</strong> — Posts relating to economic conditions, financial markets, or specific market developments: references to stock prices, commodity prices, oil markets, interest rates, inflation, or broader signals about the state of the economy (e.g., “The stock market is pricing in a recession,” “Coffee prices are up 40% and nobody is talking about it”).</p>

<blockquote>
  <p><em>“The Strait of Hormuz, which handles roughly 20% of the world’s daily oil supply, is effectively shut down. That means lower supply, which means higher prices. When oil prices rise the price of all other commodities rise.”</em></p>
</blockquote>

<p><strong>9. Thirst Trap</strong> — Posts that are flirty, self-promotional, or designed to attract attention and engagement through charm or physical appeal (e.g., “Just got a haircut and feeling myself,” “Anyone else look good today or just me?”).</p>

<h3 id="my-category-breakdown">My Category Breakdown</h3>

<p>With my categories defined, I ran <code class="language-plaintext highlighter-rouge">classify()</code> on the full year of text posts — 582 posts with non-empty text content — using Llama 3.3 70B on SambaNova. Each post was classified against all categories independently, meaning a single post can and often does belong to more than one category. A post lamenting Trump’s tariff policy, for example, might be tagged as both Anti-Trump and Economics &amp; Finance. That’s by design: my categories aren’t mutually exclusive buckets, they’re lenses.</p>

<p>The chart below shows the percentage of posts that were assigned to each category. Because categories overlap, the bars don’t sum to 100% — they can’t. What the chart is really showing is the <em>frequency</em> of each topic in my feed: how often, across 582 posts, did I reach for a given subject. Think of it less as a pie chart and more as a set of independent thermometers, each measuring how much of my posting energy went toward a given theme.</p>

<p><img src="/images/catvader-category-distribution.png" alt="" /></p>

<p>A note on scope: cat-vader can classify images directly, but for this analysis I focused on text only. My dataset includes image posts, but the model was given just the text caption — no image content. Posts without any text were excluded entirely, and image posts were classified solely on whatever caption was attached. That’s a real limitation, and one worth keeping in mind when interpreting any categories that might skew visual (more on that in a moment).</p>

<p>I’ll be honest: I wasn’t sure what I’d find. I post somewhat mindlessly — something catches my eye, I have a reaction, I type it out. I don’t sit down with a content strategy. So this is genuinely an exercise in holding up a mirror.</p>

<p>One early finding did give me pause: <strong>Shit Posting</strong> came in second overall at 27.7% of posts, just behind <strong>Social Issues</strong> at 32.5% and ahead of <strong>Technology</strong> at 25.9%. My first instinct was that something had gone wrong — a miscategorized label, a prompt that was too loose, something. I went back and spot-checked the flagged posts. Nope. Fully accurate. Apparently more than a quarter of what I put out into the world is, by any reasonable definition, a shit post. I have made peace with this, though I’ve also quietly vowed to post with a bit more intention going forward, with the goal of demoting Shit Posting from a top-three category to something more like fifth. For what it’s worth, my top four — Social Issues (32.5%), Shit Posting (27.7%), Technology (25.9%), and Partisan Politics (24.6%) — probably tells you everything you need to know about me as a person.</p>

<p>One other result worth flagging: <strong>Thirst Trap</strong> came in at exactly one post (0.2%). False positives happen, and this is a good example of why. The post in question was an image captioned simply <em>“Me”</em>, and since the model only had that single word to work with, tagging it as a thirst trap is a defensible inference. Whether it actually was one depends on the photo, which the model never saw. I’m not saying it wasn’t.</p>

<hr />

<h2 id="engagement-by-category">Engagement by Category</h2>

<p>One of the advantages of pulling data directly through the Threads API is that cat-vader returns not just post text but a full set of engagement metrics alongside it. For every post, the package outputs:</p>

<table>
  <thead>
    <tr>
      <th>Column</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">timestamp</code></td>
      <td>Date and time of the post</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">media_type</code></td>
      <td>Post type (text, image, video, repost)</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">text</code></td>
      <td>Post text content</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">image_url</code></td>
      <td>URL of attached image, if any</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">likes</code></td>
      <td>Number of likes</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">replies</code></td>
      <td>Number of replies/comments</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">reposts</code></td>
      <td>Number of reposts</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">quotes</code></td>
      <td>Number of quote posts</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">views</code></td>
      <td>Total post impressions</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">shares</code></td>
      <td>Number of shares</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">clicks</code></td>
      <td>Number of link clicks</td>
    </tr>
  </tbody>
</table>

<p>That means the classified dataset isn’t just a topic-coded text corpus — it’s a topic-coded text corpus with performance data attached. Which opens up an obvious question: does what I post about actually affect how much engagement I get?</p>

<p>The chart below shows average likes and average replies broken out by category. Because posts can belong to multiple categories, the same post may appear in more than one bar. The x-axis is sorted by average likes descending.</p>

<p><img src="/images/catvader-engagement-by-category.png" alt="" /></p>

<p>To see the full picture, the scatter plot below shows every post individually — x-axis is replies, y-axis is likes, log scale on both axes, colored by primary category. The log scale is doing a lot of work here. Without it, the chart is basically one dot in the top-right corner and 580 dots stacked on top of each other at zero. With it, you can actually see my distribution — which, in all honesty, is mostly a dense cloud of dots in the bottom-left. The majority of what I post sinks quietly into the void, liked by a handful of people who were probably just scrolling past and hit the button by accident. A few posts escape. Most do not. This is the reality of posting.</p>

<p><img src="/images/catvader-engagement-scatter.png" alt="" /></p>

<hr />

<h2 id="what-actually-predicts-views">What Actually Predicts Views?</h2>

<p>The bar charts above are descriptive — they show averages, but averages don’t control for anything. A category might look high-performing simply because I happen to post it more, or because it correlates with another category that’s doing the real work. To get a cleaner picture, I ran models that hold all categories constant simultaneously, isolating the independent effect of each one. I ran a separate model for each outcome — views, likes, and replies.</p>

<p>The chart below shows the coefficients from the views model, with 95% confidence intervals. Points in red are statistically significant (p &lt; 0.05); grey points are not.</p>

<p><img src="/images/catvader-regression-views.png" alt="" /></p>

<p>Three categories emerge as significant positive predictors of views <em>(for the stats nerds: R² = 6.8%, F p = 2.5×10⁻⁶)</em>:</p>

<ul>
  <li>
    <p><strong>Artificial Intelligence</strong> is the single strongest predictor. Holding everything else constant, an AI post gets roughly <strong>2.4x as many views</strong> as a comparable post on a different topic. AI has been the trendiest topic on the internet for the past two years, and the algorithm appears to reward it accordingly.</p>
  </li>
  <li>
    <p><strong>Partisan Politics</strong> comes in second. A partisan political post gets about <strong>2x as many views</strong> as a comparable non-partisan post. Tribal political content travels well on social media — this surprises no one.</p>
  </li>
  <li>
    <p><strong>Shit Posting</strong> rounds out the significant predictors, with posts in this category getting about <strong>1.8x as many views</strong>. Blunt, low-effort, instantly legible takes are apparently what the algorithm rewards. I have complicated feelings about this.</p>
  </li>
</ul>

<p>For likes, the story is similar but smaller in scale: partisan politics posts get about <strong>50% more likes</strong> than comparable posts, and shit posts about <strong>35% more</strong>. AI drops out entirely for likes — AI posts rack up views without proportionally converting to likes. Interesting.</p>

<p>For replies, no category made a meaningful difference. Nothing I post about reliably generates conversation, which is either a sign of epistemic humility or evidence that I am not as interesting as I think I am.</p>

<p>Topic explains only a small slice of overall engagement — most of what determines whether a post goes anywhere is timing, luck, and whether someone with a large following happens to engage. But the categories that do matter are consistent and interpretable.</p>

<p>One natural question is whether these effects are real or just a reflection of timing — maybe I happen to post AI content on Thursdays at peak hours, and it’s the timing doing the work rather than the topic. To check, I re-ran all models controlling for both day of week and hour of day.</p>

<p>The results held up for the most part. Partisan politics and AI posts still get roughly <strong>2x as many views</strong> after accounting for when they were posted — those effects appear to be about the content itself. Shit posting weakens once timing is controlled for, suggesting some of its raw advantage was coming from <em>when</em> I tend to fire off a shit post rather than the content. The reply non-result holds throughout.</p>

<p><img src="/images/catvader-regression-views-adj.png" alt="" /></p>

<p>One more alternative explanation worth ruling out: volume. Maybe on high-output days I’m simply flooding my feed and one post happens to catch a wave — meaning it’s the quantity, not the quality of the content, doing the work. The chart below plots each post’s views and likes against the number of posts I made that day.</p>

<p><img src="/images/catvader-freq-vs-engagement.png" alt="" /></p>

<p>The correlations are r = 0.16 for views and r = 0.17 for likes — small but consistently positive, suggesting a weak connection. Posting more on a given day does seem to nudge individual post performance slightly, though the effect is modest enough that it’s unlikely to be the main story. The content effects from the models hold.</p>

<hr />

<h2 id="bonus-does-the-day-of-the-week-matter">Bonus: Does the Day of the Week Matter?</h2>

<p>One last question that the timestamp data makes easy to answer: does <em>when</em> I post matter?</p>

<p><img src="/images/catvader-views-by-weekday.png" alt="" /></p>

<p>Thursday stands out immediately — average views of 3,241, more than double the next best day (Saturday at 1,439) and about four times Wednesday’s average of 398. Monday and Friday are middling. Wednesday is the worst day to post, by a wide margin.</p>

<p>I don’t have a strong theory for why Thursday in particular. It might be something about my posting behavior on Thursdays — maybe I tend to post more shareable content, or post at better times within the day. It might also just be noise: with 72–106 posts per weekday, a handful of viral Thursday posts could skew the average significantly. Either way, the answer to “when should I post?” appears to be: not Wednesday.</p>

<p>Time of day tells a cleaner story. The chart below breaks posts into five windows (Pacific Time):</p>

<p><img src="/images/catvader-views-by-timeofday.png" alt="" /></p>

<p>Late night posts (9pm–5am) average nearly 2,900 views — more than double the evening average of 1,126, and about twelve times the morning average of ~250. The pattern is monotonic: the later in the day, the more views. One plausible explanation is that late night is when I tend to post more impulsively about whatever’s dominating the news cycle, which also happens to be when more people are doom-scrolling. Another is that late night posts have more hours to accumulate views before I wake up and post something else that pushes them down. Either way, if I want views, apparently I should stay up later — which is not advice I needed the algorithm to give me.</p>

<hr />

<h2 id="conclusion-do-this-with-your-own-data">Conclusion: Do This With Your Own Data</h2>

<p>Everything in this post — pulling my data, discovering my categories, classifying 582 posts, and running the regressions — took a single afternoon. The data pull and classification itself ran in about 30 minutes; the rest was just analysis and writing. If you have a Threads account and a few API keys, you can run the same analysis on your own feed.</p>

<p>The most valuable step is <code class="language-plaintext highlighter-rouge">explore()</code> first. Don’t impose your categories from the top down. Run the exploration pass, look at what themes emerge with high frequency, and let your actual content tell you what it’s about. Your categories will be better for it, and you’ll probably learn something about yourself in the process that you wouldn’t have guessed going in.</p>

<p>From there, <code class="language-plaintext highlighter-rouge">classify()</code> gives you a labelled dataset you can take in any direction. A few starting points:</p>

<ul>
  <li><strong>Sentiment and tone.</strong> Instead of topic categories, define categories like “optimistic,” “cynical,” “ironic,” or “earnest.” You’ll get a mood profile of your posting history.</li>
  <li><strong>Audience targeting.</strong> If you post for multiple audiences — say, researchers, practitioners, and general readers — define categories for each and see how your mix has shifted over time.</li>
  <li><strong>Thread evolution.</strong> Classify posts by month and track how your topical distribution has changed. Are you posting more or less about AI than you were a year ago? The data will tell you.</li>
  <li><strong>Quote extraction.</strong> Use <code class="language-plaintext highlighter-rouge">extract()</code> instead of <code class="language-plaintext highlighter-rouge">classify()</code> to pull structured fields out of free text — named entities, specific claims, URLs, anything you want to turn into a column.</li>
</ul>

<p>Beyond social media, the underlying engine is <strong><a href="https://pypi.org/project/cat-llm/">cat-llm</a></strong>, which was designed for survey and qualitative data. If you’re a researcher sitting on thousands of open-ended survey responses, interview transcripts, or product reviews, the same pipeline applies. Define your codebook as a set of verbose category descriptions, run <code class="language-plaintext highlighter-rouge">classify()</code>, and get back a coded dataset in minutes rather than weeks. The package supports multi-model ensembles, chain-of-thought reasoning, and automatic inter-rater reliability metrics: all the things you’d want for academic coding workflows.</p>

<p>If you want to adapt it for your own platform or use case, <a href="https://github.com/chrissoria/cat-llm">cat-llm</a> is open source and built to be forked. cat-vader is one fork; there’s no reason there couldn’t be a cat-reddit, a cat-bluesky, or a cat-transcripts for interview data. The core classification and exploration logic is platform-agnostic. All you need to wire up is a data ingestion layer for whatever source you’re working with.</p>

<p>If you build something interesting with it, I’d genuinely like to hear about it. Reach out at <a href="mailto:chrissoria@berkeley.edu">chrissoria@berkeley.edu</a>.</p>

<p>One last thing: I’ll be re-running this analysis in six months to see whether I’ve made good on my vow to reduce my shit posting. The pipeline takes an afternoon. The habit change may take longer.</p>

<hr />

<h2 id="how-i-did-it-and-how-you-can-too">How I Did It, and How You Can Too?</h2>

<p>Want to run this on your own data? Here’s the technical setup.</p>

<h3 id="getting-started">Getting Started</h3>

<p><a href="https://pypi.org/project/cat-vader/">cat-vader</a> is available on PyPI:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>cat-vader
</code></pre></div></div>

<p>You’ll also need a Threads access token. Generate one via the <a href="https://developers.facebook.com/">Meta for Developers</a> portal (create an app, add the Threads product, and generate a long-lived user token), then add it to a <code class="language-plaintext highlighter-rouge">.env</code> file:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">THREADS_ACCESS_TOKEN</span><span class="o">=</span><span class="s2">"your-token-here"</span>
<span class="nv">THREADS_USER_ID</span><span class="o">=</span><span class="s2">"your-numeric-user-id"</span>
</code></pre></div></div>

<p>cat-vader will pick these up automatically when you call any function with <code class="language-plaintext highlighter-rouge">sm_source="threads"</code>. Alternatively, you can pass your API key directly as a parameter in any function call — the <code class="language-plaintext highlighter-rouge">.env</code> file is just a convenience for avoiding repetition.</p>

<p>If you already have social media data — a CSV of posts from any platform, a scraped dataset, a platform export — you can skip the API setup entirely and pass your text directly to <code class="language-plaintext highlighter-rouge">classify()</code>, <code class="language-plaintext highlighter-rouge">explore()</code>, or <code class="language-plaintext highlighter-rouge">extract()</code> via the <code class="language-plaintext highlighter-rouge">input_data</code> parameter. The <code class="language-plaintext highlighter-rouge">sm_source</code> integration is a convenience layer on top of the same classification engine.</p>

<h3 id="pulling-my-threads-history">Pulling My Threads History</h3>

<p>For the Threads API pull: cat-vader connects to your account and retrieves your full post history — every post you’ve made, along with engagement metrics — automatically, without any manual data export. You authenticate once via the Threads Graph API, store your credentials in a <code class="language-plaintext highlighter-rouge">.env</code> file, and cat-vader handles the rest.</p>

<p>Under the hood, the package paginates through your full post history (the API returns up to 100 posts per page), fetches engagement metrics for each post in a separate insights call, and returns everything as a single tidy DataFrame with one row per post. Columns include the post text, image URL (when an image was attached), media type, and metrics: likes, views, replies, reposts, quotes, and shares.</p>

<p>The key parameter is <code class="language-plaintext highlighter-rouge">sm_source="threads"</code>, which can be passed to any of the main functions — <code class="language-plaintext highlighter-rouge">classify()</code>, <code class="language-plaintext highlighter-rouge">extract()</code>, or <code class="language-plaintext highlighter-rouge">explore()</code>. You can scope the pull to a specific time window using <code class="language-plaintext highlighter-rouge">sm_months</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catvader</span> <span class="k">as</span> <span class="n">cv</span>

<span class="c1"># Pull and classify the last 12 months of your personal Threads history
</span><span class="n">results</span> <span class="o">=</span> <span class="n">cv</span><span class="p">.</span><span class="n">classify</span><span class="p">(</span>
    <span class="n">sm_source</span><span class="o">=</span><span class="s">"threads"</span><span class="p">,</span>   <span class="c1"># connect to your Threads account
</span>    <span class="n">sm_months</span><span class="o">=</span><span class="mi">12</span><span class="p">,</span>          <span class="c1"># fetch all posts from the past year
</span>    <span class="n">categories</span><span class="o">=</span><span class="p">[</span><span class="s">"Politics"</span><span class="p">,</span> <span class="s">"Technology &amp; AI"</span><span class="p">,</span> <span class="s">"Economics"</span><span class="p">,</span> <span class="s">"Health &amp; Science"</span><span class="p">,</span>
                <span class="s">"Education &amp; Research"</span><span class="p">,</span> <span class="s">"Culture &amp; Entertainment"</span><span class="p">,</span>
                <span class="s">"Social Issues"</span><span class="p">,</span> <span class="s">"Personal"</span><span class="p">],</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-openai-api-key"</span><span class="p">,</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Here are my five most-liked posts from the dataset:</p>

<table>
  <thead>
    <tr>
      <th>date</th>
      <th>text</th>
      <th>likes</th>
      <th>views</th>
      <th>replies</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>2025-11-14</td>
      <td>Credit to Representative Robert Garcia for releasing the initial three emails from the Epstein emails. Dude is risking a lot.</td>
      <td>5,146</td>
      <td>22,504</td>
      <td>61</td>
    </tr>
    <tr>
      <td>2025-02-17</td>
      <td>Testing my hypothesis that the algorithm will boost the word Costco. If this gets more than my usual 0 likes then the null is rejected. Costco Costco Costco…</td>
      <td>2,637</td>
      <td>35,387</td>
      <td>36</td>
    </tr>
    <tr>
      <td>2025-01-12</td>
      <td>Alex Jones was posting on X that L.A. firefighters were battling the blazes using ladies’ handbags as buckets because officials had donated equipment to Ukraine…</td>
      <td>1,123</td>
      <td>8,235</td>
      <td>127</td>
    </tr>
    <tr>
      <td>2025-09-24</td>
      <td>Did you all notice how your collective action just “overpowered” Trump and got Jimmy Kimmel back on the air?</td>
      <td>619</td>
      <td>8,517</td>
      <td>67</td>
    </tr>
    <tr>
      <td>2025-01-24</td>
      <td>The United State of California has a good ring to it.</td>
      <td>491</td>
      <td>7,305</td>
      <td>50</td>
    </tr>
  </tbody>
</table>

<p>For my account, pulling my full history returned <strong>850 posts</strong> going back to July 2023, about two and a half years. Of those, 176 were image posts, 5 were videos, and 582 had text content; the remainder were reposts or media-only posts.</p>

<p>One note on the metrics: the Threads Insights API takes a few hours to populate data for brand new posts, so very recent posts may show zeros. Older posts return accurate lifetime totals.</p>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="LLM" /><category term="social media" /><category term="cat-llm" /><category term="cat-vader" /><category term="NLP" /><category term="open source" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Ensemble Classification in CatLLM: Combining Multiple Models for Robust Results</title><link href="https://christophersoria.com/posts/2026/01/catllm-ensemble-classification/" rel="alternate" type="text/html" title="Ensemble Classification in CatLLM: Combining Multiple Models for Robust Results" /><published>2026-01-17T00:00:00-08:00</published><updated>2026-01-17T00:00:00-08:00</updated><id>https://christophersoria.com/posts/2026/01/catllm-ensemble-classification</id><content type="html" xml:base="https://christophersoria.com/posts/2026/01/catllm-ensemble-classification/"><![CDATA[<p><img src="/images/catllm_ensemble.png" alt="CatLLM" /></p>

<p><a href="https://pypi.org/project/cat-llm/">CatLLM</a> now supports ensemble classification—running multiple models in parallel and combining their predictions through voting. This addresses a persistent concern in LLM-based classification: how do you know if a single model’s output is reliable?</p>

<h2 id="the-problem-with-single-model-classification">The Problem with Single-Model Classification</h2>

<p>When you classify survey responses with a single LLM, you’re trusting that model’s interpretation entirely. But different models have different training data, different biases, and different failure modes. A response that GPT-4o categorizes as “positive sentiment” might be labeled “neutral” by Claude, and “mixed” by Gemini. Which one is right?</p>

<p>For research applications where classification decisions feed into statistical analyses, this uncertainty matters. Ensemble methods offer a way to quantify and reduce it.</p>

<h2 id="three-approaches-to-ensemble-classification">Three Approaches to Ensemble Classification</h2>

<h3 id="1-cross-provider-ensembles">1. Cross-Provider Ensembles</h3>

<p>You can combine models from different providers—OpenAI, Anthropic, Google, Mistral, and others—to get diverse perspectives on each classification:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catllm</span> <span class="k">as</span> <span class="n">cat</span>

<span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="p">.</span><span class="n">classify</span><span class="p">(</span>
    <span class="n">input_data</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'responses'</span><span class="p">],</span>
    <span class="n">categories</span><span class="o">=</span><span class="p">[</span><span class="s">"Positive"</span><span class="p">,</span> <span class="s">"Negative"</span><span class="p">,</span> <span class="s">"Neutral"</span><span class="p">],</span>
    <span class="n">models</span><span class="o">=</span><span class="p">[</span>
        <span class="p">(</span><span class="s">"gpt-4o"</span><span class="p">,</span> <span class="s">"openai"</span><span class="p">,</span> <span class="s">"sk-..."</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"claude-sonnet-4-5-20250929"</span><span class="p">,</span> <span class="s">"anthropic"</span><span class="p">,</span> <span class="s">"sk-ant-..."</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"gemini-2.5-flash"</span><span class="p">,</span> <span class="s">"google"</span><span class="p">,</span> <span class="s">"AIza..."</span><span class="p">),</span>
    <span class="p">],</span>
    <span class="n">consensus_threshold</span><span class="o">=</span><span class="s">"majority"</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>Why this helps:</strong> Each provider’s models are trained on different data with different objectives. When three independently-developed models agree on a classification, that agreement carries more weight than any single model’s confidence score. When they disagree, you’ve identified responses that may require human review.</p>

<h3 id="2-self-consistency-with-temperature">2. Self-Consistency with Temperature</h3>

<p>You can also ensemble the same model against itself by running it multiple times with higher temperature (randomness). This samples from the model’s probability distribution rather than always taking the most likely output:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="p">.</span><span class="n">classify</span><span class="p">(</span>
    <span class="n">input_data</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'responses'</span><span class="p">],</span>
    <span class="n">categories</span><span class="o">=</span><span class="p">[</span><span class="s">"Category A"</span><span class="p">,</span> <span class="s">"Category B"</span><span class="p">,</span> <span class="s">"Category C"</span><span class="p">],</span>
    <span class="n">models</span><span class="o">=</span><span class="p">[</span>
        <span class="p">(</span><span class="s">"gpt-4o"</span><span class="p">,</span> <span class="s">"openai"</span><span class="p">,</span> <span class="s">"sk-..."</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"gpt-4o"</span><span class="p">,</span> <span class="s">"openai"</span><span class="p">,</span> <span class="s">"sk-..."</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"gpt-4o"</span><span class="p">,</span> <span class="s">"openai"</span><span class="p">,</span> <span class="s">"sk-..."</span><span class="p">),</span>
    <span class="p">],</span>
    <span class="n">creativity</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span>  <span class="c1"># Higher temperature for varied outputs
</span>    <span class="n">consensus_threshold</span><span class="o">=</span><span class="s">"majority"</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>Why this helps:</strong> At temperature 0, a model always produces the same output for the same input. At higher temperatures, it samples from its full distribution of possible responses. If a classification is robust, the model should arrive at the same answer even when sampling differently. If it produces different answers across runs, that response is likely ambiguous or borderline.</p>

<p>This approach is cheaper than cross-provider ensembles (one API key, often lower per-token costs) while still providing a measure of classification stability.</p>

<h3 id="3-consensus-thresholds">3. Consensus Thresholds</h3>

<p>CatLLM provides three voting rules for determining consensus:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Majority: At least 50% of models must agree
</span><span class="n">consensus_threshold</span><span class="o">=</span><span class="s">"majority"</span>

<span class="c1"># Two-thirds: At least 67% of models must agree
</span><span class="n">consensus_threshold</span><span class="o">=</span><span class="s">"two-thirds"</span>

<span class="c1"># Unanimous: All models must agree
</span><span class="n">consensus_threshold</span><span class="o">=</span><span class="s">"unanimous"</span>
</code></pre></div></div>

<p>You can also specify custom numeric thresholds (e.g., <code class="language-plaintext highlighter-rouge">consensus_threshold=0.75</code> for 75% agreement).</p>

<p><strong>Majority</strong> is the least restrictive. With three models, two agreeing is sufficient. This maximizes the number of responses that receive a consensus classification.</p>

<p><strong>Two-thirds</strong> requires stronger agreement. With three models, you still need two to agree (67%), but with six models, you’d need four. This reduces false positives at the cost of more responses falling below threshold.</p>

<p><strong>Unanimous</strong> is the most restrictive. Every model must agree for a category to be marked present. This produces high-confidence classifications but may leave many responses without consensus, flagging them for human review.</p>

<h2 id="interpreting-the-output">Interpreting the Output</h2>

<p>The results DataFrame includes columns for each model’s individual classification plus the consensus:</p>

<table>
  <thead>
    <tr>
      <th>response</th>
      <th>category_1_gpt_4o</th>
      <th>category_1_claude</th>
      <th>category_1_gemini</th>
      <th>category_1_consensus</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>“Great service”</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
    </tr>
    <tr>
      <td>“It was okay”</td>
      <td>0</td>
      <td>1</td>
      <td>0</td>
      <td>0</td>
    </tr>
    <tr>
      <td>“Loved it”</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
      <td>1</td>
    </tr>
  </tbody>
</table>

<p>For the second response, GPT-4o and Gemini said “not positive” while Claude said “positive.” With majority voting, the consensus is 0 (not positive) because only 1/3 models agreed.</p>

<p>You can use the agreement patterns to:</p>
<ul>
  <li>Identify systematic differences between models</li>
  <li>Flag ambiguous responses for manual review</li>
  <li>Report inter-model reliability alongside your results</li>
</ul>

<h2 id="practical-considerations">Practical Considerations</h2>

<p><strong>Cost:</strong> Ensemble classification multiplies your API costs by the number of models. Three models means roughly 3x the cost. For large datasets, consider running ensembles on a sample first to calibrate, then using a single model for the full dataset.</p>

<p><strong>Speed:</strong> Models are called in parallel, so wall-clock time doesn’t increase linearly. Three models typically take only slightly longer than one.</p>

<p><strong>When to use ensembles:</strong> Ensembles are most valuable when classification decisions are consequential—when they feed into regression models, when you’re publishing findings, or when categories are subjective enough that reasonable people might disagree.</p>

<p><strong>When a single model suffices:</strong> For exploratory analysis, prototyping, or cases where categories are unambiguous, a single model is faster and cheaper.</p>

<h2 id="try-it">Try It</h2>

<p>Ensemble classification is available in CatLLM 0.1.16+ and in the <a href="https://huggingface.co/spaces/CatLLM/survey-classifier">web app</a>. In the web app, select “Model Comparison” to see each model’s output side-by-side, or “Ensemble” to get majority-vote consensus classifications.</p>

<h3 id="links">Links</h3>

<ul>
  <li><strong>Web App:</strong> <a href="https://huggingface.co/spaces/CatLLM/survey-classifier">https://huggingface.co/spaces/CatLLM/survey-classifier</a></li>
  <li><strong>Python Package:</strong> <a href="https://pypi.org/project/cat-llm/">https://pypi.org/project/cat-llm/</a></li>
  <li><strong>GitHub:</strong> <a href="https://github.com/chrissoria/cat-llm">https://github.com/chrissoria/cat-llm</a></li>
</ul>

<p>If you have questions or want to discuss ensemble methods for your research, reach out at <a href="mailto:ChrisSoria@Berkeley.edu">ChrisSoria@Berkeley.edu</a>.</p>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="CatLLM" /><category term="Ensemble Methods" /><category term="Large Language Models" /><category term="Survey Data" /><category term="Classification" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">CatLLM is Now a Web App</title><link href="https://christophersoria.com/posts/2026/01/catllm-web-app/" rel="alternate" type="text/html" title="CatLLM is Now a Web App" /><published>2026-01-06T00:00:00-08:00</published><updated>2026-01-06T00:00:00-08:00</updated><id>https://christophersoria.com/posts/2026/01/catllm-web-app</id><content type="html" xml:base="https://christophersoria.com/posts/2026/01/catllm-web-app/"><![CDATA[<p><img src="/images/catllm_bw_banner.png" alt="CatLLM" /></p>

<p>I’ve been working on <a href="https://pypi.org/project/cat-llm/">CatLLM</a>, a Python package for classifying open-ended survey responses with LLMs. This week I converted it into a web app.</p>

<p><strong>Try it here:</strong> <a href="https://huggingface.co/spaces/chrissoria/CatLLM">https://huggingface.co/spaces/chrissoria/CatLLM</a></p>

<h2 id="the-problem">The Problem</h2>

<p>If you’ve worked with open-ended survey data, you know the workflow: hundreds or thousands of free-text responses that need to be categorized before you can do any quantitative analysis. The traditional approach is manual coding—either doing it yourself or hiring RAs. It’s slow, expensive, and doesn’t scale.</p>

<h2 id="what-the-app-does">What the App Does</h2>

<p>The web app lets you classify survey responses without writing any code:</p>

<ol>
  <li><strong>Upload your data</strong> — CSV, Excel, or PDF documents</li>
  <li><strong>Define categories</strong> — Either specify your own categories or let the model extract them from your data automatically</li>
  <li><strong>Run classification</strong> — The model assigns each response to one or more categories (multi-label classification)</li>
  <li><strong>Download results</strong> — Get a CSV with classifications plus a methodology write-up you can adapt for your paper</li>
</ol>

<p>The same functionality is available in the Python package if you prefer working in code, but the web app removes the setup barrier for researchers who just want to try it out.</p>

<h2 id="free-models-for-now">Free Models (For Now)</h2>

<p>I’m covering the API costs temporarily so people can test it. The free tier currently includes:</p>

<ul>
  <li>GPT-4o-mini (OpenAI)</li>
  <li>Claude 3 Haiku (Anthropic)</li>
  <li>Gemini 2.5 Flash (Google)</li>
  <li>Llama 3.3 70B (via Groq)</li>
  <li>DeepSeek V3.1</li>
  <li>Qwen3 235B</li>
  <li>Mistral Medium</li>
  <li>Grok 4 Fast</li>
</ul>

<p>If you need more powerful models or have large-scale needs, you can bring your own API key.</p>

<h2 id="looking-for-feedback">Looking for Feedback</h2>

<p>This is still early. I’d appreciate it if you:</p>

<ul>
  <li><strong>Break it</strong> — Find edge cases, report bugs, tell me what fails</li>
  <li><strong>Suggest features</strong> — What would make this useful for your research?</li>
  <li><strong>Collaborate</strong> — If you’re interested in working on a methods paper evaluating LLM classification for survey data, I’m open to it</li>
</ul>

<p>You can reach me at <a href="mailto:ChrisSoria@Berkeley.edu">ChrisSoria@Berkeley.edu</a> or leave comments on the <a href="https://github.com/chrissoria/cat-llm">GitHub repo</a>.</p>

<h2 id="links">Links</h2>

<ul>
  <li><strong>Web App:</strong> <a href="https://huggingface.co/spaces/chrissoria/CatLLM">https://huggingface.co/spaces/chrissoria/CatLLM</a></li>
  <li><strong>Python Package:</strong> <a href="https://pypi.org/project/cat-llm/">https://pypi.org/project/cat-llm/</a></li>
  <li><strong>GitHub:</strong> <a href="https://github.com/chrissoria/cat-llm">https://github.com/chrissoria/cat-llm</a></li>
</ul>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="CatLLM" /><category term="Web App" /><category term="Large Language Models" /><category term="Survey Data" /><category term="Open Source" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">CatLLM Now Supports Huggingface: Access Thousands of Open-Source Models</title><link href="https://christophersoria.com/posts/2025/12/cat-llm-huggingface/" rel="alternate" type="text/html" title="CatLLM Now Supports Huggingface: Access Thousands of Open-Source Models" /><published>2025-12-30T00:00:00-08:00</published><updated>2025-12-30T00:00:00-08:00</updated><id>https://christophersoria.com/posts/2025/12/cat-llm-huggingface</id><content type="html" xml:base="https://christophersoria.com/posts/2025/12/cat-llm-huggingface/"><![CDATA[<p><img src="/images/catllm_research.png" alt="CatLLM" /></p>

<p><a href="https://pypi.org/project/cat-llm/"><strong>CatLLM</strong></a> now works with Huggingface. This means you can use open-weight and open-source models without needing the compute power to run them locally. Just get your Huggingface API key (I recommend the Pro subscription for heavy usage—it’s only $9/month) and you’ll have access to models like Qwen, DeepSeek, and Llama. This is useful for researchers who want to test open-weight models or take advantage of their lower cost for classification tasks.</p>

<p>Another benefit is access to thousands of user-trained models for specific tasks. For example:</p>

<ul>
  <li>
    <p><strong>MedAlpaca-7B (Medical Domain)</strong> - <code class="language-plaintext highlighter-rouge">medalpaca/medalpaca-7b</code> - A 7-billion parameter LLM specifically fine-tuned for medical domain tasks, built on top of the LLaMA architecture. It’s designed to improve question-answering and medical dialogue capabilities.</p>
  </li>
  <li>
    <p><strong>CodeLlama-7B (Code Generation &amp; Understanding)</strong> - <code class="language-plaintext highlighter-rouge">codellama/CodeLlama-7b-hf</code> - Meta’s specialized code-focused LLM, available in 7B, 13B, 34B, and 70B parameter versions. It comes in three variants: base (general code), Python-specific, and Instruct (for code assistance).</p>
  </li>
  <li>
    <p><strong>Aya-23-8B (Multilingual)</strong> - <code class="language-plaintext highlighter-rouge">CohereLabs/aya-23-8B</code> - Developed by Cohere Labs, Aya 23 is an instruction-tuned model supporting 23 languages including Arabic, Chinese, French, German, Hindi, Japanese, Korean, Spanish, and more.</p>
  </li>
</ul>

<p>You can also train and host your own models. Hugging Face provides several tools for fine-tuning:</p>

<ul>
  <li><strong>Transformers Library</strong> - Use the Trainer API for fine-tuning any model from the Hub</li>
  <li><strong>PEFT (Parameter-Efficient Fine-Tuning)</strong> - Techniques like LoRA and QLoRA for efficient fine-tuning with less compute</li>
  <li><strong>AutoTrain</strong> - A no-code/low-code solution for fine-tuning models directly on Hugging Face</li>
  <li><strong>TRL (Transformer Reinforcement Learning)</strong> - For RLHF and preference tuning</li>
</ul>

<p>You can upload and host your models in several ways:</p>

<ul>
  <li><strong>Web Interface</strong> - Go to huggingface.co/new, then use “Add File” → “Upload File”</li>
  <li><strong>Python Libraries</strong> - Use <code class="language-plaintext highlighter-rouge">model.push_to_hub("your-username/model-name")</code> with Transformers or the huggingface_hub library</li>
  <li><strong>Git</strong> - Since repos are Git-based, you can push directly via command line</li>
</ul>

<p>Your model doesn’t need to be compatible with Transformers—any custom model works.</p>

<p>Here are a few examples of what you could do:</p>

<ol>
  <li>
    <p><strong>Region-specific language models</strong> - Fine-tune a model specifically for extracting information from Spanish-speaking respondents from a particular country, rather than Spanish generally. For example, a model trained on Dominican or Puerto Rican Spanish would better understand the distinct vocabulary, slang, and expressions that differ significantly from Mexican Spanish.</p>
  </li>
  <li>
    <p><strong>Specialized scoring models</strong> - Train a model specifically for detecting the quality of drawn shapes for cognitive impairment assessment. Instead of relying on general-purpose vision models, you could create one optimized for CERAD-style scoring tasks.</p>
  </li>
  <li>
    <p><strong>Domain-specific extraction models</strong> - Build a model designed to extract key details from long texts in your field—such as one trained to pull specific policy details from local city council meeting transcripts, or one that identifies funding amounts and grant recipients from foundation reports.</p>
  </li>
</ol>

<h3 id="getting-started">Getting Started</h3>

<p>Using Huggingface with CatLLM is straightforward. Simply specify <code class="language-plaintext highlighter-rouge">model_source="huggingface"</code> and provide your Huggingface API key:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catllm</span> <span class="k">as</span> <span class="n">cat</span>

<span class="n">results</span> <span class="o">=</span> <span class="n">cat</span><span class="p">.</span><span class="n">multi_class</span><span class="p">(</span>
    <span class="n">survey_input</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'responses'</span><span class="p">],</span>
    <span class="n">categories</span><span class="o">=</span><span class="p">[</span><span class="s">"Category 1"</span><span class="p">,</span> <span class="s">"Category 2"</span><span class="p">,</span> <span class="s">"Category 3"</span><span class="p">],</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-huggingface-api-key"</span><span class="p">,</span>
    <span class="n">user_model</span><span class="o">=</span><span class="s">"Qwen/Qwen3-VL-235B-A22B-Instruct:novita"</span><span class="p">,</span>
    <span class="n">model_source</span><span class="o">=</span><span class="s">"huggingface"</span><span class="p">,</span>
    <span class="n">creativity</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
    <span class="n">chain_of_thought</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Here’s an example using CodeLlama to analyze code snippets for specific features:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catllm</span> <span class="k">as</span> <span class="n">cat</span>

<span class="c1"># Analyze code snippets for security and quality features
</span><span class="n">code_analysis</span> <span class="o">=</span> <span class="n">cat</span><span class="p">.</span><span class="n">multi_class</span><span class="p">(</span>
    <span class="n">survey_input</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'code_snippets'</span><span class="p">],</span>
    <span class="n">categories</span><span class="o">=</span><span class="p">[</span>
        <span class="s">"Contains SQL queries"</span><span class="p">,</span>
        <span class="s">"Has proper error handling"</span><span class="p">,</span>
        <span class="s">"Uses deprecated functions"</span><span class="p">,</span>
        <span class="s">"Contains hardcoded credentials"</span><span class="p">,</span>
        <span class="s">"Implements input validation"</span>
    <span class="p">],</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-huggingface-api-key"</span><span class="p">,</span>
    <span class="n">user_model</span><span class="o">=</span><span class="s">"codellama/CodeLlama-7b-Instruct-hf"</span><span class="p">,</span>
    <span class="n">model_source</span><span class="o">=</span><span class="s">"huggingface"</span><span class="p">,</span>
    <span class="n">creativity</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
    <span class="n">chain_of_thought</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Here’s an example using Aya-23 to classify Spanish survey responses with categories written in Spanish:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catllm</span> <span class="k">as</span> <span class="n">cat</span>

<span class="c1"># Classify Spanish survey responses about healthcare access
</span><span class="n">healthcare_analysis</span> <span class="o">=</span> <span class="n">cat</span><span class="p">.</span><span class="n">multi_class</span><span class="p">(</span>
    <span class="n">survey_input</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'respuestas_encuesta'</span><span class="p">],</span>
    <span class="n">categories</span><span class="o">=</span><span class="p">[</span>
        <span class="s">"Este participante contestó que tiene acceso a seguro médico"</span><span class="p">,</span>
        <span class="s">"Este participante mencionó barreras financieras"</span><span class="p">,</span>
        <span class="s">"Este participante expresó dificultades con el idioma"</span><span class="p">,</span>
        <span class="s">"Este participante indicó satisfacción con su atención médica"</span><span class="p">,</span>
        <span class="s">"Este participante reportó largos tiempos de espera"</span>
    <span class="p">],</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-huggingface-api-key"</span><span class="p">,</span>
    <span class="n">user_model</span><span class="o">=</span><span class="s">"CohereLabs/aya-23-8B"</span><span class="p">,</span>
    <span class="n">model_source</span><span class="o">=</span><span class="s">"huggingface"</span><span class="p">,</span>
    <span class="n">creativity</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
    <span class="n">chain_of_thought</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Here’s an example using MedAlpaca to classify medical interview notes:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">catllm</span> <span class="k">as</span> <span class="n">cat</span>

<span class="c1"># Classify patient interview notes for symptoms and conditions
</span><span class="n">medical_analysis</span> <span class="o">=</span> <span class="n">cat</span><span class="p">.</span><span class="n">multi_class</span><span class="p">(</span>
    <span class="n">survey_input</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s">'patient_notes'</span><span class="p">],</span>
    <span class="n">categories</span><span class="o">=</span><span class="p">[</span>
        <span class="s">"Patient reports cardiovascular symptoms"</span><span class="p">,</span>
        <span class="s">"Patient mentions respiratory issues"</span><span class="p">,</span>
        <span class="s">"Patient describes chronic pain"</span><span class="p">,</span>
        <span class="s">"Patient indicates mental health concerns"</span><span class="p">,</span>
        <span class="s">"Patient reports medication side effects"</span>
    <span class="p">],</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-huggingface-api-key"</span><span class="p">,</span>
    <span class="n">user_model</span><span class="o">=</span><span class="s">"medalpaca/medalpaca-7b"</span><span class="p">,</span>
    <span class="n">model_source</span><span class="o">=</span><span class="s">"huggingface"</span><span class="p">,</span>
    <span class="n">creativity</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
    <span class="n">chain_of_thought</span><span class="o">=</span><span class="bp">True</span>
<span class="p">)</span>
</code></pre></div></div>

<p><strong>Important Limitation:</strong> MedAlpaca targets medical student-level knowledge and should never be used as a substitute for professional medical advice.</p>

<h3 id="full-provider-support">Full Provider Support</h3>

<p>CatLLM now supports seven major providers:</p>

<ul>
  <li>OpenAI (GPT-4o, GPT-5)</li>
  <li>Anthropic (Claude Sonnet 4, Claude 3.5)</li>
  <li>Google (Gemini 2.5 Flash/Pro)</li>
  <li><strong>Huggingface</strong> (Qwen, Llama, DeepSeek, community models)</li>
  <li>xAI (Grok)</li>
  <li>Mistral (Mistral Large, Pixtral)</li>
  <li>Perplexity (Sonar models)</li>
</ul>

<h3 id="get-in-touch">Get in Touch</h3>

<p>If you have any questions about using Huggingface with CatLLM, or if you’d like guidance on how to fine-tune a model for your specific research needs to maximize consistency and quality of output, feel free to reach out. I’m happy to help researchers get the most out of these tools. You can contact me at <a href="mailto:ChrisSoria@Berkeley.edu">ChrisSoria@Berkeley.edu</a>.</p>

<h3 id="learn-more">Learn More</h3>

<ul>
  <li><a href="https://github.com/chrissoria/cat-llm#readme">View the documentation</a></li>
  <li><a href="https://pypi.org/project/cat-llm/">Install from PyPI</a>: <code class="language-plaintext highlighter-rouge">pip install cat-llm</code></li>
</ul>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="Huggingface" /><category term="Open-Source Models" /><category term="Large Language Models" /><category term="Python Package" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Selected as a 2025 Bashir Ahmed Graduate Fellow</title><link href="https://christophersoria.com/posts/2025/12/ahmed-fellowship/" rel="alternate" type="text/html" title="Selected as a 2025 Bashir Ahmed Graduate Fellow" /><published>2025-12-11T00:00:00-08:00</published><updated>2025-12-11T00:00:00-08:00</updated><id>https://christophersoria.com/posts/2025/12/ahmed-fellowship</id><content type="html" xml:base="https://christophersoria.com/posts/2025/12/ahmed-fellowship/"><![CDATA[<p><img src="/images/ahmed.jpg" alt="Ahmed Fellowship" /></p>

<p>I am honored to have been selected as a 2025 awardee of the Bashir Ahmed Graduate Fellowship at UC Berkeley’s Department of Demography.</p>

<p>The Bashir Ahmed Graduate Fellowship supports dissertation research for students in the Demography PhD Program and the Sociology &amp; Demography PhD Program. This fellowship provides critical support as I continue my dissertation research at the intersection of social demography, epidemiology, and computational methods.</p>

<p>My research examines how social networks correlate with mortality outcomes at the county level, investigates how partisan social networks can worsen disease outcomes, and studies the impact of loneliness on aging and health. Beyond traditional demographic research, I also apply large language models to demographic studies, developing techniques for text classification and computational analysis of survey data.</p>

<p>What makes this recognition particularly meaningful to me is that this year the fellowship was not application-based—the faculty independently selected the recipient. I am deeply grateful to the Department of Demography and the Ahmed family for this recognition and support. This fellowship will enable me to further develop my research on population health dynamics and continue bridging computational methods with demographic inquiry.</p>

<p>You can learn more about the fellowship and past recipients on the <a href="https://www.demog.berkeley.edu/about/ahmed-fellowship/">UC Berkeley Demography website</a>.</p>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="Fellowship" /><category term="UC Berkeley" /><category term="Demography" /><category term="Research" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">New Python Package: llm-web-research for Verified Web Research</title><link href="https://christophersoria.com/posts/2025/12/llm-web-research/" rel="alternate" type="text/html" title="New Python Package: llm-web-research for Verified Web Research" /><published>2025-12-11T00:00:00-08:00</published><updated>2025-12-11T00:00:00-08:00</updated><id>https://christophersoria.com/posts/2025/12/llm-web-research</id><content type="html" xml:base="https://christophersoria.com/posts/2025/12/llm-web-research/"><![CDATA[<p><img src="/images/logo_llm_researcher.png" alt="llm-web-research" /></p>

<p>I’m excited to announce the release of a new Python package: <a href="https://github.com/chrissoria/llm-web-research"><strong>llm-web-research</strong></a>. Part of the <a href="https://pypi.org/project/cat-llm/">CatLLM</a> ecosystem, this tool enables LLM-powered web research with a focus on accuracy over quantity—designed for researchers who need verified, high-quality data.</p>

<h2 id="the-problem">The Problem</h2>

<p>When using LLMs for web research, a common issue is ambiguity. Searching for information about “John Smith” or “Springfield” can return incorrect results due to name conflicts and common entities. For research applications where false positives are costly, we need a more rigorous approach.</p>

<h2 id="the-solution-multi-step-verification">The Solution: Multi-Step Verification</h2>

<p>The core innovation of llm-web-research is a 4-step verification pipeline that catches ambiguous queries before returning potentially incorrect answers:</p>

<ol>
  <li><strong>Information Gathering</strong> — Initial web search to understand the entity and context</li>
  <li><strong>Ambiguity Detection</strong> — Explicit checks for name conflicts, common names, and contradictions</li>
  <li><strong>Skeptical Verification</strong> — Secondary search actively looking for contradicting information</li>
  <li><strong>Structured Output</strong> — JSON formatting with binary confidence scoring</li>
</ol>

<p>The design philosophy is simple: <strong>no answer is better than a wrong answer.</strong></p>

<h2 id="key-features">Key Features</h2>

<ul>
  <li><strong>Two research modes:</strong> <code class="language-plaintext highlighter-rouge">precise_web_research()</code> for maximum accuracy, <code class="language-plaintext highlighter-rouge">web_research()</code> for faster single-step searches</li>
  <li><strong>Multi-provider support:</strong> Anthropic, Google Gemini, Perplexity</li>
  <li><strong>Structured output:</strong> Returns pandas DataFrames with answers and source URLs</li>
  <li><strong>Safety features:</strong> Incremental CSV saving for long-running searches, automatic “Information unclear” responses when uncertain</li>
</ul>

<h2 id="installation-and-usage">Installation and Usage</h2>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>llm-web-research
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">llm_web_research</span> <span class="k">as</span> <span class="n">lwr</span>

<span class="n">results</span> <span class="o">=</span> <span class="n">lwr</span><span class="p">.</span><span class="n">precise_web_research</span><span class="p">(</span>
    <span class="n">search_question</span><span class="o">=</span><span class="s">"founding year"</span><span class="p">,</span>
    <span class="n">search_input</span><span class="o">=</span><span class="p">[</span><span class="s">"Apple"</span><span class="p">,</span> <span class="s">"Microsoft"</span><span class="p">],</span>
    <span class="n">api_key</span><span class="o">=</span><span class="s">"your-api-key"</span><span class="p">,</span>
    <span class="n">model_source</span><span class="o">=</span><span class="s">"anthropic"</span>
<span class="p">)</span>
</code></pre></div></div>

<h2 id="use-cases">Use Cases</h2>

<ul>
  <li>Academic research requiring verified sources</li>
  <li>Fact-checking with high accuracy requirements</li>
  <li>Building high-quality datasets</li>
  <li>Automated due diligence tasks</li>
</ul>

<p>Check out the <a href="https://github.com/chrissoria/llm-web-research">GitHub repository</a> for full documentation and examples, or install directly from <a href="https://pypi.org/project/llm-web-research/">PyPI</a>.</p>

<hr />

<h2 id="acknowledgments">Acknowledgments</h2>

<p>This work was supported by the <a href="https://www.demog.berkeley.edu/about/ahmed-fellowship/">Bashir Ahmed Graduate Fellowship</a> at UC Berkeley’s Department of Demography. I am grateful to the Ahmed family and the Department of Demography for their support of my research.</p>]]></content><author><name>Contact Information</name><email>chrissoria@berkeley.edu</email></author><category term="Large Language Models" /><category term="Python Package" /><category term="Web Research" /><category term="Data Collection" /><summary type="html"><![CDATA[]]></summary></entry></feed>