MarkInSF
MarkInSF

Reputation: 21

Merging two CSV files into a Python data frame when the merge key string is not identical

I have two CSV files from different sources with data from each county in the United States. I can merge the files into a single data frame with the common key for county name, but the two files use a different string format. For example, one file -- "df1" -- has "San Francisco" while the other file -- "df2" -- has "San Francisco County." I need to compare the two columns, identify the matches, and then create an identical key for the merge. For example, if df1 contains "San Francisco," I need to check all of the rows in df2, identify that "San Francisco County" contains "San Francisco," and then create a new key column in df2 that has just "San Francisco."

Here is what I tried.

I would merge the data frames with an inner join. The two CSV files do not have the same number of rows. I believe that the operation, below, will return only rows that have a match, which is what I want.

merged_df = pd.merge(left=df1, right=df2, left_on="COUNTY", right_on="COUNTY")

I am having trouble creating the common key. Here is what I have:

# Create an empty container to store the data
COUNTY_KEY = []

for row in df2['COUNTY']:
    if df1['COUNTY'] in df2['COUNTY']:
        COUNTY_KEY.APPEND(df1(['COUNTY']))
    else:
        COUNTY_KEY.append('0')

# Create the key by adding the new data to df2
df2['COUNTY_KEY'] = COUNTY_KEY

I believe the problem is that I'm trying to compare two objects -- the county variables in the two data frames -- but I need to get inside the objects and compare them at the string level.

I have read a lot of questions and answers about merging data frames, but I haven't found anyone ask this question -- how can we merge two data frames when a common key exists but is not an exact match?

Thank you for your help!

Upvotes: 1

Views: 358

Answers (1)

gold_cy
gold_cy

Reputation: 14226

This should work,

df = pd.DataFrame({'id':[1,2,3,4,5],'county':['San Francisco County','San Jose County','Miami','Syracuse','Buffalo']})

df1 = pd.DataFrame({'id':[9,9,8,7,6],'county':['San Francisco','San Jos
e','Miami', 'Syracuse','Buffalo']})

def matched(target, val):
    target_vals = target['county'].values.tolist()
    for county in target_vals:
        if val in county:
            return county

df1['county_key'] = df1['county'].apply(lambda x: matched(df, x))

pd.merge(df, df1, how='inner', left_on='county',right_on='county_key')

           county_x       id_x       county_y  id_y            county_key
0  San Francisco County     1  San Francisco     9  San Francisco County
1       San Jose County     2       San Jose     9       San Jose County
2                 Miami     3          Miami     8                 Miami
3              Syracuse     4       Syracuse     7              Syracuse
4               Buffalo     5        Buffalo     6               Buffalo

I didn't format the final data frame but you should understand now how to accomplish your task.

Upvotes: 1

Related Questions