How to Improve Match Quality on String Data Using Large Language Models
Published:
In the world of data analysis, ensuring the accuracy and consistency of datasets is crucial, especially when dealing with entities like school names that may be spelled differently across various sources. This discrepancy can pose significant challenges when trying to match records from different datasets. Traditional methods of data cleaning may fall short in addressing these inconsistencies effectively.
In this blog post, we explore an innovative approach to improving match quality between datasets using Large Language Models (LLMs) and the Jaro-Winkler similarity algorithm. By leveraging the power of LLMs, we can “force” both datasets to align more closely in their naming conventions, enhancing the probability of finding accurate matches. The Jaro-Winkler score then helps us identify the highest probability matches, ensuring that our data integration efforts are both efficient and reliable.
Follow along below or download the python script here: https://github.com/chrissoria/chrissoria.github.io/blob/main/files/clean_schools.ipynb
Overall Summary of the Post
- I take two datasets and push the match rate from 23.83% to 67.16% using generative AI.
- this post was written by GPT4
Getting set up
Before diving into the data processing, it’s essential to set up the environment with the necessary libraries and configurations. Here’s a breakdown of the initial steps:
from openai import OpenAI
import pandas as pd
import jellyfish
import numpy as np
import json
import os
import re
import time
from dotenv import load_dotenv, find_dotenv
To keep sensitive information like API keys secure, we use environment variables. The dotenv library helps us load these variables from a local .env file:
load_dotenv()
True
This command initializes the environment variables, making them accessible within the script. It ensures that API keys and other sensitive data are not hard-coded into the script. The script also specifies different models (gpt-3.5-turbo-16k, gpt-3.5-turbo, gpt-4o) for potential use, each offering different capabilities and token limits.
_ = load_dotenv(find_dotenv()) # read local .env file
#openai.api_key = os.getenv("api.txt")
LONG_MODEL = "gpt-3.5-turbo-16k"
GPT_3 = "gpt-3.5-turbo"
GPT_4 = "gpt-4o"
To ensure that the script operates in the correct context, we set the working directory to the location of our datasets:
os.chdir('/Users/chrissoria/Documents/Research/determinants-grad-adm')
current_directory = os.getcwd()
print(current_directory)
/Users/chrissoria/Documents/Research/determinants-grad-adm
Cleaning School Names
To begin, we define a function to clean the school names by standardizing their format. This involves converting text to lowercase, removing unwanted characters, and trimming whitespace. This function does the following:
- Lowercasing: Converts all text to lowercase to ensure case-insensitive comparisons.
- Removing Punctuation: Eliminates commas and hyphens to reduce variability in naming.
- Trimming Whitespace: Strips leading and trailing spaces and replaces multiple spaces with a single space.
- Removing Articles: Removes the word “the” from the start of names to focus on the core name.
Here’s the function:
def clean_school_names(df, column_name):
df[column_name] = df[column_name].str.lower()
df[column_name] = df[column_name].str.replace(r'[,-]', '', regex=True)
df[column_name] = df[column_name].str.strip()
df[column_name] = df[column_name].str.replace(r'\s+', ' ', regex=True)
df[column_name] = df[column_name].str.replace(r'^the\s+', '', case=False, regex=True)
return df
Next, we load the Berkeley schools dataset, select the relevant column, and apply our cleaning function.
- Loading Data: Reads the Excel file containing school names.
- Selecting Columns: Focuses on the column with undergraduate degree school names.
- Renaming Columns: Renames the column for consistency in further processing.
After cleaning, we ensure that each school name is unique:
berkeley_schools = pd.read_excel('data/berkeley_schools.xlsx')
berkeley_schools = berkeley_schools[['UG Degree School']]
berkeley_schools = berkeley_schools.rename(columns = {'UG Degree School' : 'school'})
print(len(berkeley_schools))
berkeley_schools.head()
158007
school | |
---|---|
0 | University of Kentucky |
1 | Technion-Israel Inst of Tech |
2 | University of Minnesota, Twin Cities |
3 | Univ of Nairobi |
4 | University of Southern California |
Similarly, we load and prepare the IPEDS schools dataset:
berkeley_schools = clean_school_names(berkeley_schools, 'school')
berkeley_schools = pd.DataFrame(berkeley_schools['school'].dropna().unique(), columns=['school'])
print(len(berkeley_schools))
berkeley_schools.head()
5166
school | |
---|---|
0 | university of kentucky |
1 | technionisrael inst of tech |
2 | university of minnesota twin cities |
3 | univ of nairobi |
4 | university of southern california |
Preparing and Matching the IPEDS Schools Dataset
Continuing with our data preparation, we now focus on the IPEDS schools dataset. This dataset contains a comprehensive list of institutions, which we will clean and prepare for matching with the Berkeley schools dataset.
First, we load the dataset and select the relevant column containing school names:
iped_schools = pd.read_csv('data/IPEDS_schools.csv')
iped_schools = iped_schools[['INSTNM']]
iped_schools = iped_schools.rename(columns = {'INSTNM' : 'school'})
print(len(iped_schools))
iped_schools.head()
/var/folders/89/6bdxzk2j30v5n3wstywbcpg80000gn/T/ipykernel_17234/287998861.py:1: DtypeWarning: Columns (9,1537,1540,1542,1606,1608,1614,1615,1619,1620,1621,1622,1623,1624,1625,1626,1627,1628,1629,1703,1704,1725,1726,1727,1728,1729,1743,1815,1816,1817,1818,1823,1824,1830,1831,1879,1880,1881,1882,1883,1884,1885,1886,1887,1888,1889,1890,1891,1892,1893,1894,1895,1896,1897,1898,1909,1910,1911,1912,1913,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1983,1984,2376,2377,2403,2404,2495,2496,2497,2498,2499,2500,2501,2502,2503,2504,2505,2506,2507,2508,2509,2510,2511,2512,2513,2514,2515,2516,2517,2518,2519,2520,2521,2522,2523,2524,2525,2526,2527,2528,2529,2530,2958) have mixed types. Specify dtype option on import or set low_memory=False.
iped_schools = pd.read_csv('data/IPEDS_schools.csv')
6543
school | |
---|---|
0 | Alabama A & M University |
1 | University of Alabama at Birmingham |
2 | Amridge University |
3 | University of Alabama in Huntsville |
4 | Alabama State University |
We apply the same cleaning function to standardize the school names in the IPEDS dataset:
iped_schools = clean_school_names(iped_schools, 'school')
iped_schools = pd.DataFrame(iped_schools['school'].dropna().unique(), columns=['school'])
#for assesing a match rate I'll add a 1 to the df
iped_schools['match'] = 1
print(len(iped_schools))
iped_schools.head()
6405
school | match | |
---|---|---|
0 | alabama a & m university | 1 |
1 | university of alabama at birmingham | 1 |
2 | amridge university | 1 |
3 | university of alabama in huntsville | 1 |
4 | alabama state university | 1 |
With both datasets cleaned and prepared, we can now attempt to match them based on the school names. There are 5,222 unique schools in the Berkeley list and 6,543 in the IPEDS data. Our goal is to maximize the match rate, with the highest possible match being 5,222.
highest_possible_match = len(berkeley_schools)
We perform a left join to match the datasets based on the school column:
merged_1 = berkeley_schools.merge(iped_schools, on= 'school', how = 'left')
merged_1.head()
school | match | |
---|---|---|
0 | university of kentucky | 1.0 |
1 | technionisrael inst of tech | NaN |
2 | university of minnesota twin cities | NaN |
3 | univ of nairobi | NaN |
4 | university of southern california | 1.0 |
print(f"The match rate with standardizing strings in both columns and nothing more is {merged_1['match'].sum() / highest_possible_match * 100:.2f}%")
After standardizing the school names in both datasets, we achieved a match rate of 23.83%. While this is a good start, we can further improve the match rate by employing fuzzy matching techniques. Fuzzy matching helps us account for minor differences in spelling and avoid false matches between schools with similar names.
Applying the Jaro-Winkler Algorithm
To enhance our matching process, we utilize the Jaro-Winkler similarity algorithm. This algorithm is particularly effective for comparing strings and identifying matches based on a similarity score. Here’s how we implement it:
def find_best_match(school, choices, threshold=0.975):
best_match = None
highest_score = 0
for choice in choices:
score = jellyfish.jaro_winkler(school, choice)
if score > highest_score:
best_match = choice
highest_score = score
if highest_score >= threshold:
return best_match, highest_score
else:
return None, highest_score
# Apply the matching function to each school name in berkeley_schools
matches = berkeley_schools['school'].apply(lambda x: find_best_match(x, iped_schools['school']))
# Create new columns for the best match and match score
berkeley_schools['Best Match'] = matches.apply(lambda x: x[0])
berkeley_schools['Match Score'] = matches.apply(lambda x: x[1])
berkeley_schools.head()
school | Best Match | Match Score | |
---|---|---|---|
0 | university of kentucky | university of kentucky | 1.000000 |
1 | technionisrael inst of tech | None | 0.828889 |
2 | university of minnesota twin cities | university of minnesotatwin cities | 0.982521 |
3 | univ of nairobi | None | 0.857179 |
4 | university of southern california | university of southern california | 1.000000 |
- Function Definition: The find_best_match function iterates through possible matches and calculates the Jaro-Winkler score for each.
- Threshold Setting: We set a high threshold (0.975) to ensure that only highly similar matches are considered valid.
- Applying the Function: We apply the function to each school name in the Berkeley dataset to find the best match in the IPEDS dataset.
The results of the fuzzy matching process are promising. Here’s a sample of the output:
filtered_matches = berkeley_schools[berkeley_schools['Match Score'] >= 0.975]
merged_2 = filtered_matches.merge(iped_schools, left_on='Best Match', right_on='school', suffixes=('_left', '_right'), how='left')
merged_2.head()
Perfect Matches: Some schools, like “University of Kentucky” and “University of Southern California,” achieve a perfect match score of 1.0. Partial Matches: For schools with lower scores, such as “Technion-Israel Inst of Tech,” the algorithm did not find a sufficiently similar match.
school_left | Best Match | Match Score | school_right | match | |
---|---|---|---|---|---|
0 | university of kentucky | university of kentucky | 1.000000 | university of kentucky | 1 |
1 | university of minnesota twin cities | university of minnesotatwin cities | 0.982521 | university of minnesotatwin cities | 1 |
2 | university of southern california | university of southern california | 1.000000 | university of southern california | 1 |
3 | brown university | brown university | 1.000000 | brown university | 1 |
4 | colorado school of mines | colorado school of mines | 1.000000 | colorado school of mines | 1 |
print(f"The match rate with standardizing strings and a jaro-winkler threshold of .975 is {merged_2['match'].sum() / highest_possible_match * 100:.2f}%")
The match rate with standardizing strings and a jaro-winkler threshold of .975 is 26.93%
Enhancing Data Matching with AI Assistance
While our initial fuzzy matching using the Jaro-Winkler algorithm improved the match rate to 26.93%, we encountered a challenge: some entries in the Berkeley list are either international schools or not schools at all. To address this, we leverage the power of GPT to identify and filter valid domestic schools.
We use GPT to determine which entries in the Berkeley dataset are valid schools and whether they are located in the United States. This step helps us focus our matching efforts on relevant entries, potentially improving the match rate.
def identify_valid_schools(school_list,
user_model):
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
standardized_schools = []
example_json = """{"country": "United States"}"""
for school in school_list:
prompt = f"""Tell me which country the college in triple backtics is located in: ```{school}```. \
Put your response in JSON format with 'country' as the key and your output as the value. \
If it's not a valid school make the value be 'invalid'. \
Here's an example of what the JSON should look like: {example_json}"""
try:
response = client.chat.completions.create(
model=user_model,
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": f"""You provide direct and concise responses and provide only the answer to the question asked and provide only the requested JSON and nothing more."""
},
{'role': 'user',
'content': prompt}
],
temperature=0
)
standardized_school = response.choices[0].message.content
standardized_schools.append(standardized_school)
print(f"Processing row {school}")
except Exception as e:
print(f"An error occurred: {e}")
standardized_schools.append(f"Error processing input: {school}")
print('error tho')
data = []
for item in standardized_schools:
parsed_json = json.loads(item)
data.append(parsed_json)
df = pd.DataFrame(data)
standardized_schools = df['country']
return standardized_schools
- Functionality: This function queries GPT to identify the country of each school or mark it as invalid if it’s not a recognized institution.
- JSON Response: The response is formatted as JSON, making it easy to parse and use in further analysis.
We apply the identify_valid_schools function to the Berkeley dataset and create a new column to indicate whether each school is domestic:
berkeley_schools['country'] = identify_valid_schools(berkeley_schools['school'],
GPT_4)
To improve the accuracy of our data matching, we added a domestic indicator to the Berkeley schools dataset. This step helps us focus on schools located in the United States, which are more relevant for our matching purposes.
We created a new column, domestic
, to indicate whether each school is located in the United States. This column is populated based on the country
information obtained from GPT:
berkeley_schools['domestic'] = berkeley_schools['country'].apply(lambda x: 1 if 'United States' in x else 0)
berkeley_schools.to_csv('data/berkeley_schools_features.csv')
berkeley_schools
Domestic Indicator: The domestic column is set to 1 for schools in the United States and 0 for others. Data Export: The updated dataset is saved for future reference and analysis.
school | Best Match | Match Score | country | domestic | |
---|---|---|---|---|---|
0 | university of kentucky | university of kentucky | 1.000000 | United States | 1 |
1 | technionisrael inst of tech | None | 0.828889 | Israel | 0 |
2 | university of minnesota twin cities | university of minnesotatwin cities | 0.982521 | United States | 1 |
3 | univ of nairobi | None | 0.857179 | Kenya | 0 |
4 | university of southern california | university of southern california | 1.000000 | United States | 1 |
... | ... | ... | ... | ... | ... |
5161 | pec university of technology (deemed to be uni... | None | 0.736349 | India | 0 |
5162 | louisiana college | None | 0.878834 | United States | 1 |
5163 | niger | None | 0.711111 | invalid | 0 |
5164 | university of nebraska omaha | None | 0.966359 | United States | 1 |
5165 | centro universitario catolico salesiano auxilium | None | 0.841793 | Brazil | 0 |
5166 rows × 5 columns
Filtering and Merging for Improved Matches
With the domestic indicator in place, we filter the dataset to focus on high-confidence matches and merge it with the IPEDS dataset:
filtered_matches_domestic = berkeley_schools[berkeley_schools['Match Score'] >= 0.975]
merged_2 = filtered_matches.merge(iped_schools, left_on='Best Match', right_on='school', suffixes=('_left', '_right'), how='left')
merged_2.head()
school_left | Best Match | Match Score | school_right | match | |
---|---|---|---|---|---|
0 | university of kentucky | university of kentucky | 1.000000 | university of kentucky | 1 |
1 | university of minnesota twin cities | university of minnesotatwin cities | 0.982521 | university of minnesotatwin cities | 1 |
2 | university of southern california | university of southern california | 1.000000 | university of southern california | 1 |
3 | brown university | brown university | 1.000000 | brown university | 1 |
4 | colorado school of mines | colorado school of mines | 1.000000 | colorado school of mines | 1 |
- Filtering Matches: We focus on matches with a Jaro-Winkler score of 0.975 or higher to ensure high confidence.
- Merging Datasets: The filtered matches are merged with the IPEDS dataset to verify and enhance the match rate.
Further Standardizing School Names with LLM
To achieve even greater accuracy, we use a Large Language Model (LLM) to standardize the spelling of school names. This step helps eliminate minor discrepancies that could affect the match rate:
def extract_standard(school_list,
user_model):
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
standardized_schools = []
for school in school_list:
prompt = f"""Please provide me with the correct and standard way of spelling the following college in triple backtics: ```{school}``` \
put your response in JSON format with 'school' as the key and your output as the value."""
try:
response = client.chat.completions.create(
model=user_model,
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": f"""You provide direct and concise responses and provide only the answer to the question asked. \
You avoid using terms like, 'the standard way to spell this school is...' and provide only the standard way to spell the school."""
},
{'role': 'user',
'content': prompt}
],
temperature=0
)
standardized_school = response.choices[0].message.content
standardized_schools.append(standardized_school)
print(f"Processing row {school}")
except Exception as e:
print(f"An error occurred: {e}")
standardized_schools.append(f"Error processing input: {school}")
print('error tho')
data = []
for item in standardized_schools:
parsed_json = json.loads(item)
data.append(parsed_json)
df = pd.DataFrame(data)
standardized_schools = df['school']
return standardized_schools
LLM Standardization: The function queries GPT to provide the standard spelling for each school name, formatted in JSON. Error Handling: The function includes error handling to manage any issues during the API call.
Leveraging GPT-3 for Enhanced Standardization
To further refine our data matching process, we utilize GPT-3 to standardize the spelling of school names across both datasets. This step aims to reduce discrepancies caused by variations in naming conventions and improve the overall match rate. We use GPT3 because of the lower cost and relative simplicity of this task.
We apply the extract_standard
function to both the Berkeley and IPEDS datasets using GPT-3. This function queries GPT-3 to provide a consistent and standardized spelling for each school name:
berkeley_schools['gpt3_standardized'] = extract_standard(berkeley_schools['school'],
GPT_3)
berkeley_schools.head()
iped_schools['gpt3_standardized'] = extract_standard(iped_schools['school'],
GPT_3)
With standardized names in place, we perform another round of matching using the Jaro-Winkler algorithm, this time on the GPT-3 standardized columns.
The table below shows a sample of the results after applying GPT-3 standardization and matching:
matches_gpt = berkeley_schools['gpt3_standardized'].apply(lambda x: find_best_match(x, iped_schools['gpt3_standardized']))
berkeley_schools['Best Match GPT'] = matches_gpt.apply(lambda x: x[0])
berkeley_schools['Match Score GPT'] = matches_gpt.apply(lambda x: x[1])
berkeley_schools.head()
school | Best Match | Match Score | country | domestic | gpt3_standardized | Best Match GPT | Match Score GPT | |
---|---|---|---|---|---|---|---|---|
0 | university of kentucky | university of kentucky | 1.000000 | United States | 1 | University of Kentucky | University of Kentucky | 1.000000 |
1 | technionisrael inst of tech | None | 0.828889 | Israel | 0 | Technion Israel Institute of Technology | None | 0.806527 |
2 | university of minnesota twin cities | university of minnesotatwin cities | 0.982521 | United States | 1 | University of Minnesota Twin Cities | University of Minnesota Twin Cities | 1.000000 |
3 | univ of nairobi | None | 0.857179 | Kenya | 0 | University of Nairobi | None | 0.931746 |
4 | university of southern california | university of southern california | 1.000000 | United States | 1 | University of Southern California | University of Southern California | 1.000000 |
Evaluating the True Match Rate
After leveraging GPT-3 for standardization and applying the Jaro-Winkler algorithm, we achieved a match rate of 31.32%. However, this calculation used the total number of schools in the Berkeley dataset as the denominator, which includes international entries. To obtain a more accurate match rate, we need to focus on domestic schools only.
We previously asked GPT to identify whether each school in the Berkeley dataset is located in the United States. Now, we use this information to refine our denominator and calculate the true match rate:
filtered_matches_gpt = berkeley_schools[berkeley_schools['Match Score GPT'] >= 0.975]
merged_3 = filtered_matches_gpt.merge(iped_schools, left_on='Best Match GPT', right_on='gpt3_standardized', suffixes=('_left', '_right'), how='left')
merged_3.head()
- Domestic Focus: By using only domestic schools as the denominator, we ensure that our match rate reflects the relevant subset of data.
- Accurate Denominator: GPT identified 2,409 domestic schools, which serves as the new denominator for calculating the match rate.
school_left | Best Match | Match Score | country | domestic | gpt3_standardized_left | Best Match GPT | Match Score GPT | school_right | match | gpt3_standardized_right | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | university of kentucky | university of kentucky | 1.000000 | United States | 1 | University of Kentucky | University of Kentucky | 1.0 | university of kentucky | 1 | University of Kentucky |
1 | university of minnesota twin cities | university of minnesotatwin cities | 0.982521 | United States | 1 | University of Minnesota Twin Cities | University of Minnesota Twin Cities | 1.0 | university of minnesotatwin cities | 1 | University of Minnesota Twin Cities |
2 | university of southern california | university of southern california | 1.000000 | United States | 1 | University of Southern California | University of Southern California | 1.0 | university of southern california | 1 | University of Southern California |
3 | brown university | brown university | 1.000000 | United States | 1 | Brown University | Brown University | 1.0 | brown university | 1 | Brown University |
4 | colorado school of mines | colorado school of mines | 1.000000 | United States | 1 | Colorado School of Mines | Colorado School of Mines | 1.0 | colorado school of mines | 1 | Colorado School of Mines |
Final Match Rate Calculation
With the refined denominator, we can now calculate the true match rate for domestic schools:
print(f"The match rate with standardizing strings and a jaro-winkler threshold of .975 is {merged_3['match'].sum() / highest_possible_match * 100:.2f}%")
The match rate with standardizing strings and a jaro-winkler threshold of .975 is 31.32%
- True Match Rate: This calculation provides a more accurate representation of our data matching success, focusing on schools located in the United States.
berkeley_schools.to_csv('data/berkeley_schools_features.csv', index = False)
iped_schools.to_csv('data/IPEDS_schools_features.csv', index = False)
The match rate is still not very high. But, the right-side data (IPEDS) only contains data on schools from the US, therefore using the total school denominator from the left-side data (Berkeley) is adequate. Recall, earlier we asked GPT to identify whether the school was located in the US. Let’s take only those rows where the school is domestic and use that as the denominator and see what the true match rate is.
true_highest_possible_match = berkeley_schools['domestic'].sum()
print(f"GPT identified {true_highest_possible_match} domestic schools. This should be our denominator.")
GPT identified 2409 domestic schools. This should be our denominator.
Below we find that we are now able to get a 67.16% thanks to GPT and jaro-winkler matching
print(f"The match rate with standardizing strings of domestic schools and a jaro-winkler threshold of .975 is {merged_3['match'].sum() / true_highest_possible_match * 100:.2f}%")
The match rate with standardizing strings of domestic schools and a jaro-winkler threshold of .975 is 67.16%