How to Improve Match Quality on String Data Using Large Language Models

19 minute read

Published:

In the world of data analysis, ensuring the accuracy and consistency of datasets is crucial, especially when dealing with entities like school names that may be spelled differently across various sources. This discrepancy can pose significant challenges when trying to match records from different datasets. Traditional methods of data cleaning may fall short in addressing these inconsistencies effectively.

In this blog post, we explore an innovative approach to improving match quality between datasets using Large Language Models (LLMs) and the Jaro-Winkler similarity algorithm. By leveraging the power of LLMs, we can “force” both datasets to align more closely in their naming conventions, enhancing the probability of finding accurate matches. The Jaro-Winkler score then helps us identify the highest probability matches, ensuring that our data integration efforts are both efficient and reliable.

Follow along below or download the python script here: https://github.com/chrissoria/chrissoria.github.io/blob/main/files/clean_schools.ipynb

Overall Summary of the Post

  • I take two datasets and push the match rate from 23.83% to 67.16% using generative AI.
  • this post was written by GPT4

Getting set up

Before diving into the data processing, it’s essential to set up the environment with the necessary libraries and configurations. Here’s a breakdown of the initial steps:

from openai import OpenAI
import pandas as pd
import jellyfish
import numpy as np
import json
import os
import re
import time
from dotenv import load_dotenv, find_dotenv

To keep sensitive information like API keys secure, we use environment variables. The dotenv library helps us load these variables from a local .env file:

load_dotenv()
True

This command initializes the environment variables, making them accessible within the script. It ensures that API keys and other sensitive data are not hard-coded into the script. The script also specifies different models (gpt-3.5-turbo-16k, gpt-3.5-turbo, gpt-4o) for potential use, each offering different capabilities and token limits.

_ = load_dotenv(find_dotenv()) # read local .env file

#openai.api_key = os.getenv("api.txt")
LONG_MODEL = "gpt-3.5-turbo-16k"
GPT_3 = "gpt-3.5-turbo"
GPT_4 = "gpt-4o"

To ensure that the script operates in the correct context, we set the working directory to the location of our datasets:

os.chdir('/Users/chrissoria/Documents/Research/determinants-grad-adm')
current_directory = os.getcwd()
print(current_directory)
/Users/chrissoria/Documents/Research/determinants-grad-adm

Cleaning School Names

To begin, we define a function to clean the school names by standardizing their format. This involves converting text to lowercase, removing unwanted characters, and trimming whitespace. This function does the following:

  • Lowercasing: Converts all text to lowercase to ensure case-insensitive comparisons.
  • Removing Punctuation: Eliminates commas and hyphens to reduce variability in naming.
  • Trimming Whitespace: Strips leading and trailing spaces and replaces multiple spaces with a single space.
  • Removing Articles: Removes the word “the” from the start of names to focus on the core name.

Here’s the function:

def clean_school_names(df, column_name):
    df[column_name] = df[column_name].str.lower()
    df[column_name] = df[column_name].str.replace(r'[,-]', '', regex=True)
    df[column_name] = df[column_name].str.strip()
    df[column_name] = df[column_name].str.replace(r'\s+', ' ', regex=True)
    df[column_name] = df[column_name].str.replace(r'^the\s+', '', case=False, regex=True)
    return df

Next, we load the Berkeley schools dataset, select the relevant column, and apply our cleaning function.

  • Loading Data: Reads the Excel file containing school names.
  • Selecting Columns: Focuses on the column with undergraduate degree school names.
  • Renaming Columns: Renames the column for consistency in further processing.

After cleaning, we ensure that each school name is unique:

berkeley_schools = pd.read_excel('data/berkeley_schools.xlsx')
berkeley_schools = berkeley_schools[['UG Degree School']]
berkeley_schools = berkeley_schools.rename(columns = {'UG Degree School' : 'school'})
print(len(berkeley_schools))
berkeley_schools.head()
158007
school
0University of Kentucky
1Technion-Israel Inst of Tech
2University of Minnesota, Twin Cities
3Univ of Nairobi
4University of Southern California

Similarly, we load and prepare the IPEDS schools dataset:

berkeley_schools = clean_school_names(berkeley_schools, 'school')

berkeley_schools = pd.DataFrame(berkeley_schools['school'].dropna().unique(), columns=['school'])

print(len(berkeley_schools))
berkeley_schools.head()
5166
school
0university of kentucky
1technionisrael inst of tech
2university of minnesota twin cities
3univ of nairobi
4university of southern california

Preparing and Matching the IPEDS Schools Dataset

Continuing with our data preparation, we now focus on the IPEDS schools dataset. This dataset contains a comprehensive list of institutions, which we will clean and prepare for matching with the Berkeley schools dataset.

First, we load the dataset and select the relevant column containing school names:

iped_schools = pd.read_csv('data/IPEDS_schools.csv')
iped_schools = iped_schools[['INSTNM']]
iped_schools = iped_schools.rename(columns = {'INSTNM' : 'school'})
print(len(iped_schools))
iped_schools.head()
/var/folders/89/6bdxzk2j30v5n3wstywbcpg80000gn/T/ipykernel_17234/287998861.py:1: DtypeWarning: Columns (9,1537,1540,1542,1606,1608,1614,1615,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1703,1704,1725,1726,1727,1728,1729,1743,1815,1816,1817,1818,1823,1824,1830,1831,1879,1880,1881,1882,1883,1884,1885,1886,1887,1888,1889,1890,1891,1892,1893,1894,1895,1896,1897,1898,1909,1910,1911,1912,1913,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1983,1984,2376,2377,2403,2404,2495,2496,2497,2498,2499,2500,2501,2502,2503,2504,2505,2506,2507,2508,2509,2510,2511,2512,2513,2514,2515,2516,2517,2518,2519,2520,2521,2522,2523,2524,2525,2526,2527,2528,2529,2530,2958) have mixed types. Specify dtype option on import or set low_memory=False.
  iped_schools = pd.read_csv('data/IPEDS_schools.csv')


6543
school
0Alabama A & M University
1University of Alabama at Birmingham
2Amridge University
3University of Alabama in Huntsville
4Alabama State University

We apply the same cleaning function to standardize the school names in the IPEDS dataset:

iped_schools = clean_school_names(iped_schools, 'school')

iped_schools = pd.DataFrame(iped_schools['school'].dropna().unique(), columns=['school'])

#for assesing a match rate I'll add a 1 to the df
iped_schools['match'] = 1

print(len(iped_schools))
iped_schools.head()
6405
schoolmatch
0alabama a & m university1
1university of alabama at birmingham1
2amridge university1
3university of alabama in huntsville1
4alabama state university1

With both datasets cleaned and prepared, we can now attempt to match them based on the school names. There are 5,222 unique schools in the Berkeley list and 6,543 in the IPEDS data. Our goal is to maximize the match rate, with the highest possible match being 5,222.

highest_possible_match = len(berkeley_schools)

We perform a left join to match the datasets based on the school column:

merged_1 = berkeley_schools.merge(iped_schools, on= 'school', how = 'left')
merged_1.head()
schoolmatch
0university of kentucky1.0
1technionisrael inst of techNaN
2university of minnesota twin citiesNaN
3univ of nairobiNaN
4university of southern california1.0
print(f"The match rate with standardizing strings in both columns and nothing more is {merged_1['match'].sum() / highest_possible_match * 100:.2f}%")

After standardizing the school names in both datasets, we achieved a match rate of 23.83%. While this is a good start, we can further improve the match rate by employing fuzzy matching techniques. Fuzzy matching helps us account for minor differences in spelling and avoid false matches between schools with similar names.

Applying the Jaro-Winkler Algorithm

To enhance our matching process, we utilize the Jaro-Winkler similarity algorithm. This algorithm is particularly effective for comparing strings and identifying matches based on a similarity score. Here’s how we implement it:

def find_best_match(school, choices, threshold=0.975):
    best_match = None
    highest_score = 0
    for choice in choices:
        score = jellyfish.jaro_winkler(school, choice)
        if score > highest_score:
            best_match = choice
            highest_score = score
    if highest_score >= threshold:
        return best_match, highest_score
    else:
        return None, highest_score

# Apply the matching function to each school name in berkeley_schools
matches = berkeley_schools['school'].apply(lambda x: find_best_match(x, iped_schools['school']))

# Create new columns for the best match and match score
berkeley_schools['Best Match'] = matches.apply(lambda x: x[0])
berkeley_schools['Match Score'] = matches.apply(lambda x: x[1])

berkeley_schools.head()
schoolBest MatchMatch Score
0university of kentuckyuniversity of kentucky1.000000
1technionisrael inst of techNone0.828889
2university of minnesota twin citiesuniversity of minnesotatwin cities0.982521
3univ of nairobiNone0.857179
4university of southern californiauniversity of southern california1.000000
  • Function Definition: The find_best_match function iterates through possible matches and calculates the Jaro-Winkler score for each.
  • Threshold Setting: We set a high threshold (0.975) to ensure that only highly similar matches are considered valid.
  • Applying the Function: We apply the function to each school name in the Berkeley dataset to find the best match in the IPEDS dataset.

The results of the fuzzy matching process are promising. Here’s a sample of the output:

filtered_matches = berkeley_schools[berkeley_schools['Match Score'] >= 0.975]

merged_2 = filtered_matches.merge(iped_schools, left_on='Best Match', right_on='school', suffixes=('_left', '_right'), how='left')
merged_2.head()

Perfect Matches: Some schools, like “University of Kentucky” and “University of Southern California,” achieve a perfect match score of 1.0. Partial Matches: For schools with lower scores, such as “Technion-Israel Inst of Tech,” the algorithm did not find a sufficiently similar match.

school_leftBest MatchMatch Scoreschool_rightmatch
0university of kentuckyuniversity of kentucky1.000000university of kentucky1
1university of minnesota twin citiesuniversity of minnesotatwin cities0.982521university of minnesotatwin cities1
2university of southern californiauniversity of southern california1.000000university of southern california1
3brown universitybrown university1.000000brown university1
4colorado school of minescolorado school of mines1.000000colorado school of mines1
print(f"The match rate with standardizing strings and a jaro-winkler threshold of .975 is {merged_2['match'].sum() / highest_possible_match * 100:.2f}%")
The match rate with standardizing strings and a jaro-winkler threshold of .975 is 26.93%

Enhancing Data Matching with AI Assistance

While our initial fuzzy matching using the Jaro-Winkler algorithm improved the match rate to 26.93%, we encountered a challenge: some entries in the Berkeley list are either international schools or not schools at all. To address this, we leverage the power of GPT to identify and filter valid domestic schools.

We use GPT to determine which entries in the Berkeley dataset are valid schools and whether they are located in the United States. This step helps us focus our matching efforts on relevant entries, potentially improving the match rate.

def identify_valid_schools(school_list,
                     user_model):
    
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    
    standardized_schools = []
    
    example_json = """{"country": "United States"}"""
    
    for school in school_list:
        prompt = f"""Tell me which country the college in triple backtics is located in: ```{school}```. \
        Put your response in JSON format with 'country' as the key and your output as the value. \
        If it's not a valid school make the value be 'invalid'. \
        Here's an example of what the JSON should look like: {example_json}"""
        try:
            response = client.chat.completions.create(
                model=user_model,
                response_format={"type": "json_object"},
                messages=[
                    {
                      "role": "system",
                      "content": f"""You provide direct and concise responses and provide only the answer to the question asked and provide only the requested JSON and nothing more."""
                    },
                    {'role': 'user', 
                     'content': prompt}
                ],
                temperature=0
            )


            standardized_school = response.choices[0].message.content
            standardized_schools.append(standardized_school)
            print(f"Processing row {school}")
        except Exception as e:
            print(f"An error occurred: {e}")
            standardized_schools.append(f"Error processing input: {school}")
            print('error tho')
            
    data = []
    
    for item in standardized_schools:
        parsed_json = json.loads(item)
        data.append(parsed_json)
        
    df = pd.DataFrame(data)
    standardized_schools = df['country']
    
    return standardized_schools
  • Functionality: This function queries GPT to identify the country of each school or mark it as invalid if it’s not a recognized institution.
  • JSON Response: The response is formatted as JSON, making it easy to parse and use in further analysis.

We apply the identify_valid_schools function to the Berkeley dataset and create a new column to indicate whether each school is domestic:

berkeley_schools['country'] = identify_valid_schools(berkeley_schools['school'],
                                                          GPT_4)

To improve the accuracy of our data matching, we added a domestic indicator to the Berkeley schools dataset. This step helps us focus on schools located in the United States, which are more relevant for our matching purposes.

We created a new column, domestic, to indicate whether each school is located in the United States. This column is populated based on the country information obtained from GPT:

berkeley_schools['domestic'] = berkeley_schools['country'].apply(lambda x: 1 if 'United States' in x else 0)
berkeley_schools.to_csv('data/berkeley_schools_features.csv')
berkeley_schools

Domestic Indicator: The domestic column is set to 1 for schools in the United States and 0 for others. Data Export: The updated dataset is saved for future reference and analysis.

schoolBest MatchMatch Scorecountrydomestic
0university of kentuckyuniversity of kentucky1.000000United States1
1technionisrael inst of techNone0.828889Israel0
2university of minnesota twin citiesuniversity of minnesotatwin cities0.982521United States1
3univ of nairobiNone0.857179Kenya0
4university of southern californiauniversity of southern california1.000000United States1
..................
5161pec university of technology (deemed to be uni...None0.736349India0
5162louisiana collegeNone0.878834United States1
5163nigerNone0.711111invalid0
5164university of nebraska omahaNone0.966359United States1
5165centro universitario catolico salesiano auxiliumNone0.841793Brazil0

5166 rows × 5 columns

Filtering and Merging for Improved Matches

With the domestic indicator in place, we filter the dataset to focus on high-confidence matches and merge it with the IPEDS dataset:

filtered_matches_domestic = berkeley_schools[berkeley_schools['Match Score'] >= 0.975]

merged_2 = filtered_matches.merge(iped_schools, left_on='Best Match', right_on='school', suffixes=('_left', '_right'), how='left')
merged_2.head()
school_leftBest MatchMatch Scoreschool_rightmatch
0university of kentuckyuniversity of kentucky1.000000university of kentucky1
1university of minnesota twin citiesuniversity of minnesotatwin cities0.982521university of minnesotatwin cities1
2university of southern californiauniversity of southern california1.000000university of southern california1
3brown universitybrown university1.000000brown university1
4colorado school of minescolorado school of mines1.000000colorado school of mines1
  • Filtering Matches: We focus on matches with a Jaro-Winkler score of 0.975 or higher to ensure high confidence.
  • Merging Datasets: The filtered matches are merged with the IPEDS dataset to verify and enhance the match rate.

Further Standardizing School Names with LLM

To achieve even greater accuracy, we use a Large Language Model (LLM) to standardize the spelling of school names. This step helps eliminate minor discrepancies that could affect the match rate:

def extract_standard(school_list,
                     user_model):
    
    client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
    
    standardized_schools = []
    
    for school in school_list:
        prompt = f"""Please provide me with the correct and standard way of spelling the following college in triple backtics: ```{school}``` \
        put your response in JSON format with 'school' as the key and your output as the value."""
        try:
            response = client.chat.completions.create(
                model=user_model,
                response_format={"type": "json_object"},
                messages=[
                    {
                      "role": "system",
                      "content": f"""You provide direct and concise responses and provide only the answer to the question asked. \
                      You avoid using terms like, 'the standard way to spell this school is...' and provide only the standard way to spell the school."""
                    },
                    {'role': 'user', 
                     'content': prompt}
                ],
                temperature=0
            )


            standardized_school = response.choices[0].message.content
            standardized_schools.append(standardized_school)
            print(f"Processing row {school}")
        except Exception as e:
            print(f"An error occurred: {e}")
            standardized_schools.append(f"Error processing input: {school}")
            print('error tho')
            
    data = []
    
    for item in standardized_schools:
        parsed_json = json.loads(item)
        data.append(parsed_json)
        
    df = pd.DataFrame(data)
    standardized_schools = df['school']
    
    return standardized_schools

LLM Standardization: The function queries GPT to provide the standard spelling for each school name, formatted in JSON. Error Handling: The function includes error handling to manage any issues during the API call.

Leveraging GPT-3 for Enhanced Standardization

To further refine our data matching process, we utilize GPT-3 to standardize the spelling of school names across both datasets. This step aims to reduce discrepancies caused by variations in naming conventions and improve the overall match rate. We use GPT3 because of the lower cost and relative simplicity of this task.

We apply the extract_standard function to both the Berkeley and IPEDS datasets using GPT-3. This function queries GPT-3 to provide a consistent and standardized spelling for each school name:

berkeley_schools['gpt3_standardized'] = extract_standard(berkeley_schools['school'],
                                                          GPT_3)

berkeley_schools.head()
iped_schools['gpt3_standardized'] = extract_standard(iped_schools['school'],
                                                          GPT_3)

With standardized names in place, we perform another round of matching using the Jaro-Winkler algorithm, this time on the GPT-3 standardized columns.

The table below shows a sample of the results after applying GPT-3 standardization and matching:

matches_gpt = berkeley_schools['gpt3_standardized'].apply(lambda x: find_best_match(x, iped_schools['gpt3_standardized']))

berkeley_schools['Best Match GPT'] = matches_gpt.apply(lambda x: x[0])
berkeley_schools['Match Score GPT'] = matches_gpt.apply(lambda x: x[1])

berkeley_schools.head()
schoolBest MatchMatch Scorecountrydomesticgpt3_standardizedBest Match GPTMatch Score GPT
0university of kentuckyuniversity of kentucky1.000000United States1University of KentuckyUniversity of Kentucky1.000000
1technionisrael inst of techNone0.828889Israel0Technion Israel Institute of TechnologyNone0.806527
2university of minnesota twin citiesuniversity of minnesotatwin cities0.982521United States1University of Minnesota Twin CitiesUniversity of Minnesota Twin Cities1.000000
3univ of nairobiNone0.857179Kenya0University of NairobiNone0.931746
4university of southern californiauniversity of southern california1.000000United States1University of Southern CaliforniaUniversity of Southern California1.000000

Evaluating the True Match Rate

After leveraging GPT-3 for standardization and applying the Jaro-Winkler algorithm, we achieved a match rate of 31.32%. However, this calculation used the total number of schools in the Berkeley dataset as the denominator, which includes international entries. To obtain a more accurate match rate, we need to focus on domestic schools only.

We previously asked GPT to identify whether each school in the Berkeley dataset is located in the United States. Now, we use this information to refine our denominator and calculate the true match rate:

filtered_matches_gpt = berkeley_schools[berkeley_schools['Match Score GPT'] >= 0.975]

merged_3 = filtered_matches_gpt.merge(iped_schools, left_on='Best Match GPT', right_on='gpt3_standardized', suffixes=('_left', '_right'), how='left')
merged_3.head()
  • Domestic Focus: By using only domestic schools as the denominator, we ensure that our match rate reflects the relevant subset of data.
  • Accurate Denominator: GPT identified 2,409 domestic schools, which serves as the new denominator for calculating the match rate.
school_leftBest MatchMatch Scorecountrydomesticgpt3_standardized_leftBest Match GPTMatch Score GPTschool_rightmatchgpt3_standardized_right
0university of kentuckyuniversity of kentucky1.000000United States1University of KentuckyUniversity of Kentucky1.0university of kentucky1University of Kentucky
1university of minnesota twin citiesuniversity of minnesotatwin cities0.982521United States1University of Minnesota Twin CitiesUniversity of Minnesota Twin Cities1.0university of minnesotatwin cities1University of Minnesota Twin Cities
2university of southern californiauniversity of southern california1.000000United States1University of Southern CaliforniaUniversity of Southern California1.0university of southern california1University of Southern California
3brown universitybrown university1.000000United States1Brown UniversityBrown University1.0brown university1Brown University
4colorado school of minescolorado school of mines1.000000United States1Colorado School of MinesColorado School of Mines1.0colorado school of mines1Colorado School of Mines

Final Match Rate Calculation

With the refined denominator, we can now calculate the true match rate for domestic schools:

print(f"The match rate with standardizing strings and a jaro-winkler threshold of .975 is {merged_3['match'].sum() / highest_possible_match * 100:.2f}%")
The match rate with standardizing strings and a jaro-winkler threshold of .975 is 31.32%
  • True Match Rate: This calculation provides a more accurate representation of our data matching success, focusing on schools located in the United States.
berkeley_schools.to_csv('data/berkeley_schools_features.csv', index = False)
iped_schools.to_csv('data/IPEDS_schools_features.csv', index = False)

The match rate is still not very high. But, the right-side data (IPEDS) only contains data on schools from the US, therefore using the total school denominator from the left-side data (Berkeley) is adequate. Recall, earlier we asked GPT to identify whether the school was located in the US. Let’s take only those rows where the school is domestic and use that as the denominator and see what the true match rate is.

true_highest_possible_match = berkeley_schools['domestic'].sum()
print(f"GPT identified {true_highest_possible_match} domestic schools. This should be our denominator.")
GPT identified 2409 domestic schools. This should be our denominator.

Below we find that we are now able to get a 67.16% thanks to GPT and jaro-winkler matching

print(f"The match rate with standardizing strings of domestic schools and a jaro-winkler threshold of .975 is {merged_3['match'].sum() / true_highest_possible_match * 100:.2f}%")
The match rate with standardizing strings of domestic schools and a jaro-winkler threshold of .975 is 67.16%