Chris Soria

CatLLM Now Builds Custom Datasets from Web Data

2025-10-09T00:00:00-07:00

We’ve added a powerful new feature to CatLLM that helps researchers construct custom datasets from scratch using unstructured web data.

Social scientists often need to gather information about people, organizations, or events that isn’t available in a single database. Previously, this meant hours of manual web searches and data entry. Now, CatLLM can automatically search the web and extract structured information, turning endless online data into research-ready datasets.

How It Works

The build_web_research_dataset function takes a list of search subjects (names, companies, locations, etc.) and a research question, then searches the web to compile your answers into a structured dataset. The function currently works with Anthropic and Google models, with Perplexity search implementation on the horizon.

Here’s a simple example that gathers academic department information for UC Berkeley faculty:

import catllm as cat

list_names = ["Chris Soria", "Matthew Stenberg", "Sara Quigley"]

test = cat.build_web_research_dataset(
search_question="Academic Department they belong to at UC Berkeley?",
search_input=list_names,
api_key=api_key,
answer_format="just the department name",
time_delay=5, # I recommend a time delay of at least 5 seconds to avoid rate limiting
user_model="claude-sonnet-4-20250514")

Example Use Cases

The function works for a wide range of research questions:

Climate research: What is the hottest temperature on record for these cities?
Policy analysis: What kind of COVID restrictions did this county have in 2021?
Political science: What political party did this county’s mayor belong to in 2024?
Career outcomes: Where did this person end up after they got their PhD from Berkeley in 2023?

Beta Testing

Results are currently being beta tested—thank you to UC Berkeley Graduate Division for helping us refine this feature. Early results show that the function can save researchers dozens of hours on data collection tasks while maintaining high accuracy for factual information.

Whether you’re building a dataset of company headquarters, tracking organizational affiliations, or gathering biographical information for your study, this new feature transforms web research from a manual slog into an automated workflow.

Learn More

Contribute to Open-Source Dementia Research: Improving CERAD Image Scoring

2025-06-03T00:00:00-07:00

We’re developing an open-source tool to improve cognitive assessment and need collaborators for a research publication. We would benefit greatly from expert input by people who specialize in machine learning, deep learning, computer vision models, and related fields.

What We’re Building

We’ve created a function that automatically scores drawings from the CERAD Constructional Praxis test. This test asks patients to copy simple shapes like circles, diamonds, rectangles, and cubes. The quality of these drawings provides data points that feed into larger AI algorithms designed to classify dementia.

Our tool is a modified version of the CatLLM extract_image_multi_class function that has been adapted to output actual numerical scores based on CERAD scale criteria for each image, rather than simple categories.

Current Functionality

The function can:

Score drawings of different shapes using established Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) criteria
Process individual images or batches of images
Utilize different AI models (OpenAI, Anthropic, Perplexity, Mistral) for analysis
Save progress to CSV files during processing
Output structured scores that integrate with downstream dementia classification algorithms

pip install cat-llm
import catllm as cat
circle_images_scored = cat.cerad_drawn_score(
    shape="circle",
    image_input=pr_cog['c_72_1_pic_path'].tolist(),
    api_key=open_ai_key,
    safety=True,
    reference_in_image=True,
    user_model="gpt-4o",
    filename=f"{save_path}/c_72_1_machine_score_full.csv",
)

Example Script

Areas for Improvement

We’ve identified several areas where contributions could enhance the tool’s accuracy and utility:

Model and Prompt Optimization

Refining prompts to improve scoring accuracy against CERAD criteria
Testing alternative AI models better suited for medical image analysis
Implementing model ensembling for more reliable scores
Developing prompt ensembling techniques

Image Processing

Preprocessing images to optimize them for AI analysis
Expanding compatibility with tablet-drawn images
Supporting various image formats and input methods

Technical Enhancements

Improving computational efficiency and cost-effectiveness
Enhancing error handling for edge cases
Post-processing score calculations
Expanding to other cognitive assessment tests like MMSE

Research Publication Opportunity

This package will be submitted to academic journal and will be open-source and available to all. All contributors will be listed as co-authors on the publication, providing:

Peer-reviewed publication credit
Open-source software development experience
Healthcare AI application portfolio building
Collaboration in medical technology research

How to Contribute

Email: Contact chrissoria@berkeley.edu with questions or proposals

Direct Development: Make improvements directly to the Git repository and submit pull requests. We can provide test images for performance evaluation or run tests with your enhancements.

Context and Impact

This tool addresses the need for standardized, reproducible scoring in cognitive assessment. Manual scoring introduces variability between assessors, while automated scoring provides consistent input data for dementia classification systems.

The function serves as one component in a larger pipeline where consistent image scoring enables more reliable downstream AI analysis for cognitive health assessment.

Contact chrissoria@berkeley.edu or access our repository to begin contributing to this research effort.

How We’re Using OpenAI’s GPT-4o for Dementia Classification

2025-06-02T00:00:00-07:00

Dementia Screening with AI: Analyzing Cognitive Drawing Tests

I work with the Caribbean American Dementia and Aging Study (CADAS), where we’re collecting nationally representative household survey data from adults aged 65+ in Puerto Rico and Dominican Republic, building upon the 10/66 Dementia Research Group’s earlier work that revealed remarkably high dementia prevalence rates of 10-12% among older adults in these Caribbean communities.

We are building up towards various studies, all of which require an accurate classification of dementia (binary yes or no). One of the important elements that goes into this classification is whether the respondents from the study were able to draw a series of shapes well. However, we found some discrepancies in how the different interviewers were scoring the images (where interviewers seemed to be more and less harsh than each other). We decided that it would be great to have a second look at the images to better assess where things might be going wrong.

This challenge reflects broader problems in dementia diagnostics, where subjective human assessments often vary significantly between evaluators. After exploring various options, we decided to use OpenAI’s GPT-4o model - a multimodal AI system with advanced vision capabilities - to analyze and classify these drawings. This approach aligns with emerging research showing that AI systems can provide more consistent evaluation methods for cognitive assessments. Our experience highlighted the urgent need for standardized, objective assessment tools in dementia screening, which led us to develop the AI-based solution described in the following section.

The Challenge of Manual Assessment

For decades, clinicians and researchers have relied on simple drawing tasks-like copying two overlapping pentagons-to help screen for dementia in older adults. Typically, these drawings are scored manually by trained professionals who assess whether the drawing meets specific criteria established by dementia researchers. The quality of these drawings, and particularly whether they accurately depict two intersecting pentagons, plays a significant role in determining cognitive status and potential dementia diagnosis.

However, this manual assessment process faces several critical challenges. As highlighted in research, traditional paper-based tests scored by humans often introduce inconsistencies and errors. When a single individual makes judgment calls on what constitutes correctly versus incorrectly drawn elements, scoring becomes subjective. Additionally, manually reviewing thousands of drawings is tedious, time-consuming, and costly. More traditionally trained deep learning models might also fail for this task given that the images often contain lots of non-relevant information-such as writing, instructions, and reference images-that aren’t the drawings we’re interested in. We need something that can truly understand where the actual drawing is on the image and distinguish it from other elements on the page.

Our AI-Powered Approach

Our approach leverages AI models from OpenAI to transform this process. Using a tool called CatLLM, we combine vision models with language models (like GPT-4o) to automatically analyze these drawings and extract critical features. The AI evaluates specific elements-such as whether lines properly intersect or shapes are correctly closed-and assigns scores based on established criteria.

What makes this approach particularly valuable is our ability to compare AI assessments with human evaluations. Rather than having specialists review all 4,000 images in our dataset, they only need to examine the 400-800 cases where human and AI assessments differ-an efficiency improvement of 80-90%. This creates a powerful verification system where humans and AI complement each other’s strengths.

pip install cat-llm
import catllm as cat
circle_images_scored = cat.cerad_drawn_score(
    shape="circle",
    image_input=pr_cog['c_72_1_pic_path'].tolist(),
    api_key=open_ai_key,
    safety=True,
    reference_in_image=True,
    user_model="gpt-4o",
    filename=f"{save_path}/c_72_1_machine_score_full.csv",
)

Privacy and Ethical Considerations

Importantly, our approach maintains strict privacy standards. While we work with thousands of drawings, none contain personally identifiable information. This ethical consideration is crucial, as researchers implementing similar systems should be mindful that sharing images with external AI providers could potentially expose sensitive information if proper precautions aren’t taken.

Conclusion

By leveraging AI in this manner, we’re not just making the assessment process more efficient-we’re potentially improving its accuracy and consistency, addressing a significant healthcare challenge in dementia screening.

A Tool for Demographers and Social Scientists to Categorize Open-Ended Survey Data: CatLLM

2025-05-30T00:00:00-07:00

We built a tool for demographers and social scientists who want to make sense of open-ended survey responses.

This Python package, called CatLLM, gives social scientists a few main features:

Extract categories from a corpus of survey responses or text.
Sort responses/text into categories
Assign multiple categories to images

Key Features

Enter your list of categories and your survey answers and the package outputs categories.
The output is always organized and reliable and ready to plug into statistical models.
Relies on the latest “promp engineering” techniques.
Compatible with OpenAI, Anthropic, Mistral, and Perplexity (Llama) models.
Skips over any missing survey answers so you don’t waste money on unnecessary API calls.
If your system crashes, your work is safe—no lost progress.
The package double-checks that you entered the categories you wanted.
Offers both single-category and multi-category sorting options.

Beta Test Results

The following results, along with others I will publish soon, show that large language models (LLMs) are practical tools for data categorization—even for social scientists who need high accuracy.

Jake Derr, a data scientist at UC Berkeley, helped us test CatLLM. We used a set of real job categories (based on official government codes) and ran the tool on 6,399 survey responses. The process took about about 4 hours (236 minutes) and cost around $12. For comparison, Jake manually sorted the same data in about 20 hours.

The tool’s results matched Jake’s choices most of the time. They disagreed on about 8.8% of the answers, which is similar to the kind of disagreement you’d expect between two people doing the same task. Most of these differences were due to reasonable differences in interpretation, not mistakes. Sometimes, either the model or Jake made a small error.

During testing, we (I) accidentally entered 23 categories instead of 24 (missed a comma!). To prevent this in the future, we added a feature that asks users to double-check their category list.

Overall, CatLLM helped us sort survey responses faster and more efficiently—without needing to hire a third person to double-check everything. We used the model as a “quality assurance tool,” after which Jake ended up changing a signficant proportion of his assigned categories. In the end, the model and Jake agreed 96% of the time.

While CatLLM and similar large language model tools can greatly speed up and reduce the cost of annotation—sometimes by a factor of 30 or more compared to traditional human annotation—they are not meant to replace the valuable work of human annotators like Jake. Research shows that while LLMs can outperform crowd workers on some tasks and offer substantial cost savings, there are still cases where human expertise is essential, especially for ambiguous or complex data points. The best results come from combining the speed and efficiency of LLMs with human oversight, ensuring high-quality, reliable annotations.

Next, we’ll review the answers where Jake and the model disagreed. We’ll talk through each case and make a final decision together. Human judgment always comes first: if there’s a disagreement, we discuss it and go with the human’s choice.

We tested the function llm_extract_multi_class which assigns categories to your entire corpus with just four lines of code:

pip install cat-llm
import catllm as cat
jobs_categorized = cat.multi_class(
    survey_question = job_question, 
    survey_input = jobs['Response'], 
    categories = job_categories)

Examples of Disagreements Between the Model and Jake

Here are a few examples where CatLLM and Jake categorized the same job title differently:

“Creative Lead”
- Jake: Computer and Mathematical Occupations
- Model: Arts, Design, Entertainment, Sports, and Media Occupations
“Analyst”
- Jake: Life, Physical, and Social Science Occupations
- Model: Business and Financial Occupations
“Circuit Design”
- Jake: Architecture and Engineering Occupations
- Model: Computer and Mathematical Occupations
“Server”
- Jake: Food Preparation and Serving Related Occupations
- Model: Sales and Related Occupations
“Program Management”
- Jake: Computer and Mathematical Occupations
- Model: Management Occupations
“Workforce Coordinator”
- Jake: Business and Financial Operations Occupations
- Model: Management Occupations
“Vet SEC”
- This one was tricky. “Vet SEC” is actually a security company.
  - Jake: Healthcare Practitioners and Technical Occupations (he thought it was related to veterinary care)
  - Model: Marked three categories, including the correct one: Protective Services. It also selected “Unclear” because it wasn’t sure.
“External vendor management”
- Jake: Sales and Related Occupations
- Model: Business and Financial Operations Occupations
“Order support”
- Jake: Transportation and Material Moving Occupations
- Model: Office and Administrative Support Occupations
“Public Policy Specialist”
- Jake: Business and Financial Operations Occupations
- Model: Life, Physical, and Social Science Occupations

These examples show that even with clear job titles, people and models can interpret them differently. Most disagreements were reasonable and highlight the importance of human review in the final decision process.

###References:

Soria, C. (2025). CatLLM (0.0.8). Zenodo. https://doi.org/10.5281/zenodo.15532317

What Large Language Models Can Tell Us About Our Own Free Will

2025-04-15T00:00:00-07:00

After a presentation at this year’s Population Association of America, I had a fascinating hallway conversation with Matt Haur from Florida State about large language models (LLMs) and their reasoning capabilities. Our chat took an unexpected philosophical turn when we discussed a simple test for these AI systems.

Matt suggested asking an AI to “generate a random number” - a seemingly simple task that reveals something profound about these systems. When given this prompt, AI models don’t simply output a number. Instead, they go through what looks like an existential crisis, thinking:

“The user probably expects common numbers like 7 or 42, so I should avoid those…” “50 seems too perfect and even, so that wouldn’t appear random…” “What number would seem truly random to a human?”

After potentially hundreds of reasoning steps, the AI finally outputs something like “37” - a number that feels random-ish to humans but was actually the product of deliberate, predictable reasoning.

This made us both laugh because it highlights a key limitation: these systems can’t produce true randomness. No matter how complex their reasoning appears, their output remains part of a causal chain defined by their programming and training data.

The irony is that this AI limitation mirrors philosophical questions about human free will. When we think we’re making “random” or “free” choices, are we actually just running through our own complex but ultimately deterministic neural processes? Are our choices similarly constrained by our biology and experiences?

The Human Parallel

This simple AI experiment serves as a perfect metaphor for one of philosophy’s oldest questions: How free are our own choices, really?

Say, for example, I ask you to name your favorite food.

Your response will be determined by a series of things fully outside of your control as well as some that appear to be partially within it. If you say “pizza,” your answer wasn’t random at all. It came from:

Your past experiences with pizza (your “training data”)
Biological factors like your taste preferences for salt, fat, or carbs
Environmental influences like family pizza nights as a child or the comfort it brought during stressful college days

For you to name pizza as your favorite food, certain conditions had to be met: you needed exposure to pizza in your life, your biology needed to respond positively to it, and your environment needed to create positive associations with it.

The Illusion of Choice

Your response to this might be something like, “Yes, but there were times in my life that I deliberately chose to move in the direction of pizza. There were times where I had the choice between pizza and salad, but I chose pizza.”

Yes, you did choose pizza in these situations, but much like the model chose the random number 37, your choices were not truly “free.” They were a consequence of your biological and environmental programming. That is, these choices were the result of a plethora of previous events and situations that primed you into choosing pizza. And, no matter how many logical steps you take in your decision to eat pizza or some other food, ultimately whatever decision you make will also be a part of the causal chain of events that led you to the moment where you decide; your decision is not “random,” it was determined by the long thread of events that led up to it, not just in your life but of all the combination of moments happening around you and before you were even born.

We are but plankton in the sea of chaos.

The Stories We Tell Ourselves

Your next thought might be something like, “Yes, but it’s still my choice that I’ve reasoned myself into, unlike the model, which only simulates reasoning.”

Perhaps this is true. But consider: when Claude solves a math problem, it doesn’t actually calculate the answer—it searches its training data for the most likely response. It gives the right answer because the solution steps exist somewhere in what it’s been taught.

But how different are we, really?

A judge who gives harsher sentences right before lunch will offer detailed legal justifications—never mentioning their empty stomach as the real influence. Someone who becomes more socially conservative after smelling something disgusting won’t attribute their stricter moral judgments to a biological disgust response triggering a “purge the environment” instinct.

Just as AI models provide convincing justifications for their answers without understanding their true internal processes, we humans craft elaborate explanations for decisions largely driven by unconscious forces. When asked why we made a particular choice, we construct a rational narrative, oblivious to the countless internal and external factors that actually shaped our decision.

We are not the authors of our thoughts, but their witnesses—patterns of an electric storm convincing ourselves we’re in control of it.

I had a great time at PAA 2025

2025-04-12T00:00:00-07:00

I presented shared work with Professor William Dow and Henry Dow, comparing sociodemographic characteristics of older Hispanic adults in U.S. immigrant populations with those in their countries of origin, using census microdata and the American Community Survey (publication pending).

I also spoke on a panel organized by the Population Reference Bureau with others utilizing AI for demographic research. The panel discussed using Large Language Models for data categorization and theme extraction, best practices, and whether these models are ready to replace human annotators in these tasks (paper pending submission).

Shout outs to:

Marc Garcia from Syracuse University for his great feedback on our paper during the Aging and Migration session.
My co-author and advisor William Dow for his support and expertise.
Junhe Yang from the University of Washington for her amazing work with language models to augment research.
Mathew Hauer from Florida State University for sharing insights during the AI in demography panel.
Diana Elliott, Vice President of Programs at PRB, for asking valuable questions during the AI panel.
Emilio Zagheni for grounding the discussion during the AI in demography panel.
Mao-Mei Liu for her constructive feedback and support on my presentation.
Kraemer Winslow for her thoughtful advice on public speaking!

I presented at the Pacific Sociological Association Conference in San Francisco

2025-03-27T00:00:00-07:00

I had the opportunity to present our research on public school background and probability of enrollment in a graduate degree at the 2025 PSA conference.

This project is in collaboration with Matthew Stenberg, Sara Quigley, Catherine Madsen, and Lisa Bedolla.

Our analysis of graduate enrollment in R1 institutions reveals that students from public undergraduate institutions have significantly lower odds of enrollment (19% lower in the full model) compared to their private school counterparts. Similarly, students with community college experience face approximately 28% reduced odds of R1 enrollment. While undergraduate GPA strongly predicts success (57% higher odds per point increase), first-generation students experience about 14% lower enrollment odds than continuing-generation peers, even when controlling for institutional prestige and academic performance.

I presented at the “Ask a Science Envoy” Event!

2025-03-18T00:00:00-07:00

I recently presented my research at the Science Envoy “Ask a Science Envoy” event at HopMonk Tavern in Novato—my first time presenting at a brewery and to a non-academic audience. The experience taught me the importance of navigating sensitive topics like partisanship and belief in science, especially when audience members began debating the subject.

Special thanks to Emiliana Simon-Thomas for connecting me with the Wonderfest Science Envoy program and to Tucker Hiatt for his welcoming guidance and patience.

Field Data Collection in the Dominican Republic

2024-12-27T00:00:00-08:00

I’m super grateful to the Dominican team of The Caribbean American Dementia and Aging Study (CADAS).

The Dominican people are some of the warmest, most welcoming, kind, and hard-working I’ve encountered. I hope to return some day.

Understanding the Three-Group SIR Step by Step

2024-12-20T00:00:00-08:00

In our upcoming paper, “Measuring and Modeling the Impact of Partisan Differences in Health Behaviors on COVID-19 Dynamics,” we use a three-group Susceptible-Infected-Recovered model to highlight the importance of incorporating partisan differences into models of disease transmission. In this blog post, I want to fully explain what is happening in the background for readers who may be interested in utlizing it themselves. For those users, we also built an R shiny app (soon to be published as well). The link to the shiny app will be: here.

There is a lot to unpack here, so let’s start with the basic conceptualization of how the model.

A Simplified Model

In the above diagram, there are so five transitions possible.

First, $\lambda$ is the force of infection. It combines the chance of getting infected, when meeting someone (which we are setting to 5%) with how often people meet and how many infected people are around. We will represent the chance of getting infected with a $\tau$, for transmission probability. To determine how many people are to be met, and how many of them are infected, we also need to define the size of the population, which we will represent with $N$. The size of the population will repeatedly be adjusted as people drop out (or die) due to the disease, but the initial size of the population will be represented as $N0$.

The more infected people and the more contacts, the higher $\lambda$ becomes, increasing the spread of the disease. The formula for calculating $\lambda$ in this simple example, where we don’t take into account partisan groups or mask wearing, is:

$$ \lambda = \tau c \frac{I}{N} $$

In the above equation, we see that $c$, the average number of contacts, directly increases or decreases $\lambda$.

The number of new infections at in the early phases of the simulation, those people who leave the susceptible class, is therefore calculated as:

$$\text{number of infections} = \frac{dS}{dt} = -S\lambda$$

Now, there is some percentage of people that have ended up as “infected.” Of the infected individuals, we must define how long they will remain in this state. We represent the recovery rate as $\rho$, which is the inverse of the average duration of infectiousness. Specifically:

$$\text{Average duration of infectiousness} = \frac{1} {\rho}$$

In our study, we use $\rho$ = .1, which translates to average duration of infectiousness is 10 time units (days).

To summarize, in order for a transition from $S$ to $I$ to occur, we need a person in state $S$ to come into contact with someone in state $I$. Their probability of transitioning is dependent on $\lambda$, which is calcuted as a combination of transmission probability$\tau$ and average number of contacts $c$. Indirectly, $\lambda$ is impacted by $\rho$, which helps determine the proportion infected at any time $\frac{I}{N}$. To calculate new infections at time t, we multiply $\lambda$ by $S$. A higher $\rho$ leads to faster rate at which individuals leave the infected state which in turns lowers $\lambda$ by reducing the number of infected individuals $I$.

Next, we need to define what happens with people after they’ve become infected. In this case, they can either move into the recovered state or they can die. The transitions are governed by two rates:

$\rho$ - the recovery rate (inverse of the average infectious period) and $\mu$ - the probability of dying following infection.

The rate at which infected individuals move to the deceased state is calculated as $\rho$ times $\mu$ times $I$. In other words, the percentage of infected people who leave the infected state and then die is calculated as:

$$\text{number of deaths} = \frac{dD}{dt} = I\rho\mu$$

Those who do not die transition to the recovered class $R$ at a rate of $\rho$ times $(1-\mu)$ times $I$. This represents the proportion of infected individuals who leave the infected state and recover is calculated as:

$$R = I\rho(1-\mu)$$

Once in the recovered class, just like we observe in COVID-19, individuals lose their immunity and return to the susceptible class at a rate $\gamma$.This represents waning immunity and is calculated as:

$$\text{Rate of Waning Immunity} = R\gamma$$

Now, the overall loop is complete. But, keep in mind that these formulas only represent a single time-step transition. In reality, we are generating these calculations across a set of time units. In our case, we use days and set the default number of days at 250 so we can zoom in on the initial outbreak.

This SIR model implementation uses the deSolve package to numerically solve the system of differential equations over time. The model simulates the progression of the epidemic for each day, updating the state variables (S, I, R) based on the calculated rates of change. This allows us to observe how the epidemic evolves over the specified time period, capturing the dynamics of disease spread, recovery, and the effects of various interventions like vaccination and behavioral changes.

Incorporating Protection (Mask-Usage)

The Berkeley Interpersonal Contacts Study (BICS) showed that Republicans and Democrats report wearing masks at different rates. Since mask-wearing affects disease spread, our model needs to account for these partisan differences. We’ve expanded our basic SIR model to include two new categories: protected ($P$) and unprotected ($U$). This means each state in our model is now split into two. For example, we now have Susceptible Protected ($S_P$) and Susceptible Unprotected ($S_U$), which when added together equal ($S$).

Essentially, the protected class is impacted by s scaling factor $\kappa$, where $\kappa = 1$ means no protection and $\kappa = 0$ perfect protection. Thus, the name “protected” class is a bit of a misnomer if $\kappa$ is not set to 0. Alternatively, we could label it a transmission mitigated class. Thus, for the transmission mitigated class, the force of infection, $\lambda$, is scaled down by a factor $\kappa$. To show this mathematically, we first split the infected class into two groups: $\frac{I}{N} = \frac{I_U}{N} + \frac{I_P}{N}$.

Since $\kappa$ only scales transmission probabilities for the protected, we multiply only against $\frac{I_P}{N}$ so that:

$$\lambda = \tau c (\frac{I_U}{N} + \frac{I_P}{N}\kappa)$$

The above formula implies that contacts with the protected limits an individual’s probability of becoming infected. Of course, the reverse is also true. More contact with the unprotected relatively increases an individual’s probability of becoming infected. However, before becoming infected, the individual also falls into either $S_P$ or $S_U$, which means their probability of becoming infected can become reduced even further. This alters our equation calculating how many people ended as infected for the protected group as:

$$I_P = \frac{dSP}{dt} = -S_P \cdot \lambda \cdot \kappa$$

And those who “choose” not to wear protection effectively remains the same:

$$I_U = \frac{dSU}{dt} = -S_U \cdot \lambda$$

Next, we need a parameter for determining the rate at which people choose to adopt protective behavior. That is, we need a way of transitioning some individuals from the unprotected classes to the protected. For this, we utilize $\pi$, which represents a background rate of adopting protective behavior. It enters our model in the following ways:

First, it reduces the population in $S_U$ at a rate $\pi$ and adds them to $S_P$. But, as we saw during the COVID-19 pandemic, people don’t wear masks forever. We must also incorporate a rate of waning adoption, i.e. a transition of individuals from the protected class back to the unprotected. We will represent this rate as: $\phi$.

$$\frac{dS_U}{dt} = -S_U\cdot\lambda - \pi\cdot S_U + \phi\cdot S_U + \gamma R_U$$

On the other hand (notice $\kappa$):

$$\frac{dS_P}{dt} = -S_P\lambda\cdot\kappa + \pi\cdot S_U - \phi S_U + \gamma R_P$$

In other words, the protected susceptible population ($S_P$) increases as unprotected individuals adopt protective behaviors (at rate $\pi$) and decreases as protected individuals stop using protection (at rate $\phi$). Conversely, the unprotected susceptible population ($S_U$) changes in the opposite direction. Also, at any one point, there are people leaving the recovered classes ($R_U$ and $R_P$) and rejoining their respective susceptible classes. $\pi$ and $\phi$ are also constantly interacting with the other compartments $I$ and $R$, but for the sake of brevity and conciseness I will leave those out of this blog and refer you to the project’s GitHub Repo.

Almost done, but there’s one last component we need to consider: vaccination.

To incorporate vaccination into our model, we move vaccinated individuals directly from the susceptible to the recovered class. This approach assumes that vaccines provide immunity similar to natural infection, with the same waning rate ($\gamma$). The model introduces two key vaccination parameters:

vacc: The daily vaccination rate (e.g., vacc = 0.006 means 0.6% of the population is vaccinated daily)

vstart: The time step when vaccination becomes available

This simplified approach allows us to model the impact of vaccination on disease spread without adding extra compartments, though it doesn’t account for potential differences in immunity between vaccinated and naturally recovered individuals. After we add vaccination, the equations for calculating the susceptible become:

$$\frac{dS_U}{dt} = -(S_U\cdot\lambda) - (\pi\cdot S_U) + (\phi\cdot S_P) + (\gamma \cdot R_U) - (vacc \cdot S_U)$$

$$\frac{dS_P}{dt} = -(S_P \cdot \lambda \cdot\kappa) + (\pi\cdot S_U) - (\phi \cdot S_P) + (\gamma \cdot R_P) - (vacc \cdot S_P)$$

On the other side of the process, the recovered equations become:

$$\frac{dR_U}{dt} = (\rho \cdot (1-\mu) \cdot I_U) - (\pi \cdot R_U) + (\phi \cdot R_P) - (\gamma \cdot R_U) + (vacc \cdot S_U)$$

$$\frac{dR_P}{dt} = (\rho \cdot (1-\mu) \cdot I_P) - (\pi \cdot R_U) - (\phi \cdot R_P) - (\gamma \cdot R_P) + (vacc \cdot S_P)$$

In summary, to account for differences in “protective”, mitigating, behavior, we split up each compartment (S, I, R) into sub-compartments for the protected and unprotected. Most directly, this alters the probability that people in the susceptible class transition into the infected class by altering the equation for $\lambda$. However, indirectly, this impacts the overall pandemic by reducing the proportion of people in the infected class ($I = I_U + I_S$) at any one time, effectively creating a positive feedback loop where $\lambda$ being lower contributes to further declines in $\lambda$ in future states (See equation 1 for the force of infection formula). However, people don’t wear protection forever, and our model reflects that through a parameter $\gamma$ which represents waning adoption. The vaccination rate ($vacc$) and vaccination start time ($vstart$) allows us to transition a percentage of people out of the susceptible compartments and into the recovered compartments, yet they slowly rejoin the susceptible class as their immunity wanes at a rate $\rho$.

Challenge questions:

Why is $(\pi \cdot R_U)$ in our calculation $\frac{dR_P}{dt}$? Why is $(\phi \cdot R_P)$ in our calculation $\frac{dR_U}{dt}$?
How would you draw the $vacc$ parameter onto the flow chart at the beginning of this section? What’s an alternative way to incorporate vaccination?
What is one limitation of the $\kappa$ parameter? What are its assumptions?
How would you calculate the number of people vaccinated at any one point?

Incorporating Groups (Partisans)

After this, incorporating partisanship is relatively easy. As we went over in the previous section, the model extends a standard SIR framework by incorporating both protection status. Each partisan group (Republicans, Democrats, and Independents) is divided into protected and unprotected individuals, creating six distinct subpopulations:

For each partisan group $i \in {a,b,c}$, we track:

Susceptible Population: $S_{Ui} \text{ (Unprotected)} \quad \text{and} \quad S_{Pi} \text{ (Protected)}$

Infected Population: $I_{Ui} \text{ (Unprotected)} \quad \text{and} \quad I_{Pi} \text{ (Protected)}$

Recovered Population: $R_{Ui} \text{ (Unprotected)} \quad \text{and} \quad R_{Pi} \text{ (Protected)}$

Where:

$i = a$: Republicans
$i = b$: Democrats
$i = c$: Independents

The model tracks these compartments simultaneously with group-specific parameters:

Protection adoption rate: $\pi_i$ -Contact rate: $c_i$
Mortality rate: $\mu_i$

Using republicans as an example, we would calculate the rate of change in the susceptible class like this:

For "unprotected" Republicans: $$\frac{dS_{Ua}}{dt} = -(S_{Ua} \cdot \lambda_a) - (\pi_a\cdot S_{Ua}) + (\phi_a \cdot S_{Pa}) + (\gamma \cdot R_{Ua}) - (vacc_a \cdot S_{Ua})$$ And for the "protected" Republicans: $$\frac{dS_{Pa}}{dt} = -(S_{Pa} \cdot \lambda_a \cdot \kappa) + (\pi_a \cdot S_{Ua}) - (\phi_a \cdot S_{Pa}) + (\gamma \cdot R_{Pa}) - (vacc_a \cdot S_{Pa})$$

These two groups make up the susceptible compartment for the Republican group. Notice here that every parameter except for $\gamma$ and $\kappa$ is being assigned a subscript a. This is because $\gamma$ is the rate of waninng immunity, and we aren’t expecting that to vary by party affiliation. As for $\kappa$, which represents the effectiveness of “protection,” It’s quite possible that Democrats, for example, are wearing protection that is more effective than Republicans. For example, we know that N95 masks are more effective at reducing transmission than cloth masks. However, for this early iteration of the model, we will keep it simple and calculate one value for $\kappa$.

Further, survey data from BICS shows that Republicans, Democrats, and Independents adopt protective behaviors like mask-wearing at different rates. To capture this variation, each group needs to have its own protection adoption rate ($\pi$) and protection waning rate ($\phi$). These group-specific rates reflect the real-world differences in how quickly each partisan group takes up and abandons protective measures.

But why does $\lambda$ change based on partisan group? The force of infection varies by partisan group because it depends not just on how many contacts someone has, but who those contacts are and their protection status. Since protection rates differ by party, we must account for both the quantity and partisan composition of contacts. For example, if Republicans primarily interact with other Republicans who are less likely to be protected, their force of infection ($\lambda_a$) will naturally differ from Democrats who might mainly contact other Democrats with higher protection rates. This interaction between social mixing patterns and group-specific protection rates creates distinct transmission dynamics for each partisan group. For example, for Republicans, we calculate force of infection as:

$$\lambda_a = \tau \left( c_{a,a} \left(\frac{I_{Ua}}{N_a} + \kappa\frac{I_{Pa}}{N_a}\right) + c_{a,b} \left(\frac{I_{Ub}}{N_b} + \kappa\frac{I_{Pb}}{N_b}\right) + c_{a,c} \left(\frac{I_{Uc}}{N_c} + \kappa\frac{I_{Pc}}{N_c}\right) \right)$$

There’s a lot going on here, so let’s break it down step-by-step. Frst, notice that the transmission probablity ($\tau$) is constant and doesn’t have a subscript. This is because the transmission probability is a feature of the virus, which doesn’t stop to consider your political affiliation before it infects you. Next, instead of just $c$ (contacts) we now have:

$c_{a,a}$: Republican contacts with Republicans
$c_{a,b}$: Republican contacts with Democrats
$c_{a,c}$: Republican contacts with Independents

Each contact rate is multiplied by the proportion of infected individuals in the corresponding group. Therefore, $\lambda_a$ increases with both the frequency of contacts and the infection levels within contacted groups. On the other hand, it decreases when contacts switch from a group that’s highly infected to a group that’s less infected (due to some combination of lower contacts or lower mask usage). We could also see this as three distinct $\lambda$ parameters, or:

$$\lambda_a = \lambda_{a,a} + \lambda_{a,b} + \lambda_{a,c}$$

Let’s move on to the infected.

$$ \frac{dI_{Ua}}{dt} = (S_{Ua} \cdot \lambda_a) - (\pi_a \cdot I_{Ua}) + (\phi_a \cdot I_{Pa}) - (\rho \cdot I_{Ua}) $$ $$ \frac{dI_{Pa}}{dt} = (S_{Pa} \cdot \kappa \cdot \lambda_a) + (\pi_a \cdot I_{Ua}) - (\phi_a \cdot I_{Pa}) - (\rho \cdot I_{Pa}) $$

Calculating those in the the rate of change in the recovered class is done like this:

$$\frac{dR_{Ua}}{dt} = (\rho \cdot (1-\mu_a) \cdot I_{Ua}) - (\pi_a \cdot R_{Ua}) + (\phi_a \cdot R_{Pa}) - (\gamma \cdot R_{Ua}) + (vacc_a \cdot S_{Ua})$$

Notice here that features of the disease remain the same, but anything that involves differences in behavior are assigned a unique subscript a, b, or c. This means $mu$ gets a subscrupt $mu_a, because the probability of dying is partially a consequence of age, and partisan groups tend to differ along average age.

Considering Homophily

| | a | b | c | |-----------|-------|-------|-------| | a | a,a | a,b | a,c | | b | a,b | b,b | b,c | | c | a,c | c,b | c,c |