user3813620
user3813620

Reputation: 359

How can I merge a Pandas dataframes based on a substring from one of the columns?

I have 2 dataframes: df1 and df2

df1
                  School Conference
0              Air Force   Mt. West
1                  Akron        MAC
2  Alabama at Birmingham      C-USA
3                 Auburn   Sun Belt

df2
                           SCHOOL_NAME           RATE
0                    Auburn University           93.0
1                    Air Force Academy           53.0
2                           Birmingham           75.0
3                  University of Akron           77.0

I would like to get the output below, basically binding the `RATE` column from df2 into df1 based on substring from School column
                  School Conference  RATE
0              Air Force   Mt. West  53.0
1                  Akron        MAC  77.0
2  Alabama at Birmingham      C-USA  75.0
3                 Auburn   Sun Belt  93.0

I tried the code below but it's not working. When I run it, it seems to execute successfully, but nothing happens

for i in range(1, len(df1)):
    if df1['School'][i] in df2['SCHOOL_NAME']:
       pd.merge(df1, df2, how = 'left', left_on = 'School', right_on = 'SCHOOL_NAME')

Upvotes: 4

Views: 91

Answers (1)

David Erickson
David Erickson

Reputation: 16683

You can use list comprehension to check if the columns from each dataframe are in each other (you also compare case-insensitively) and then merge:

df1['SCHOOL_NAME'] = df1['School'].apply(lambda x: [y for y in df2['SCHOOL_NAME']
                                                    if x in y or y in x]).str[0]
df1 = df1.merge(df2, how='left').drop('SCHOOL_NAME', axis=1) #can pass on='SCHOOL_NAME' to merge.
df1
Out[1]: 
                  School Conference  RATE
0              Air Force   Mt. West  53.0
1                  Akron        MAC  77.0
2  Alabama at Birmingham      C-USA  75.0
3                 Auburn   Sun Belt  93.0

You could also search case-insensitively by adding .lower() to x and y:

df1['SCHOOL_NAME'] = df1['School'].apply(lambda x: [y for y in df2['SCHOOL_NAME']
                                                    if x.lower() in y.lower()
                                                    or y.lower() in x.lower()]).str[0]
df1 = df1.merge(df2, how='left').drop('SCHOOL_NAME', axis=1) #can pass on='SCHOOL_NAME' to merge.
df1
Out[2]:
                  School Conference  RATE
0              Air Force   Mt. West  53.0
1                  Akron        MAC  77.0
2  Alabama at Birmingham      C-USA  75.0
3                 Auburn   Sun Belt  93.0

Single line of code per comment:

df1 = (df1.assign(SCHOOL_NAME = df1['School'].apply(lambda x: [y for y in df2['SCHOOL_NAME']
                                                    if x.lower() in y.lower()
                                                    or y.lower() in x.lower()]).str[0])
          .merge(df2, how='left').drop('SCHOOL_NAME', axis=1))
df1
Out[3]: 
                  School Conference  RATE
0              Air Force   Mt. West  53.0
1                  Akron        MAC  77.0
2  Alabama at Birmingham      C-USA  75.0
3                 Auburn   Sun Belt  93.0

Upvotes: 4

Related Questions