aero8991
aero8991

Reputation: 333

Join two pandas dataframes and match data to one column

I have following 2 data frames that are taken from excel files:

df_a = 10000 rows (like the master list that has all unique #s)

df_b = 670 rows

I am loading a excel file (df_b) that has zip, address, state and I want to match that info and then add on the supplier # from df_a so that I could have 1 file thats still 670 rows but now has the supplier row column.

df_a =
(10000 rows)
  (unique)
 supplier #  ZIP   ADDRESS        STATE   Unique Key
0  7100000  35481  14th street     CA      35481-14th street-CA
1  7000005  45481  14th street     CA      45481-14th street-CA
2  7000006  45482  140th circle     CT     45482-140th circle-CT
3  7000007  35482  140th circle     CT     35482-140th circle-CT
4  7000008  35483  13th road        VT     35483-13th road-VT

df_b = 
(670 rows) 
    ZIP   ADDRESS          STATE         Unique Key
0    35481  14th street     CA      35481-14th street-CA
1    45481  14th street     CA      45481-14th street-CA
2   45482  140th circle     CT     45482-140th circle-CT
3    35482  140th circle     CT     35482-140th circle-CT
4    35483  13th road        VT     35483-13th road-VT

OUTPUT:
 df_c =
(670 rows)
    ZIP   ADDRESS          STATE         Unique Key          (Unique)supplier #
0    35481  14th street     CA      35481-14th street-CA          7100000 
1    45481  14th street     CA      45481-14th street-CA          7100005
2   45482  140th circle     CT     45482-140th circle-CT          7100006
3    35482  140th circle     CT     35482-140th circle-CT          7100007
4    35483  13th road        VT     35483-13th road-VT            7100008

I tried merging the 2 dfs together but they are not matching and instead im getting a bunch of NAn

df10 = df_a.merge(df_b, on = 'Unique Key', how= 'left'

The result is 1 data frame with lots of columns and no matches. Also, Ive tried .map and .concat as well. I'm not sure whats going on.

Upvotes: 0

Views: 132

Answers (1)

CreekGeek
CreekGeek

Reputation: 2329

have you tried

df10 = df_a.merge(df_b, on = 'Unique Key', how = "inner")

an 'inner join' retains only common records which, IIUC, is what you're trying to achieve

ADDED 2021-02-14

creating csvs from your test data and reading into pandas

enter image description here

enter image description here

df_mrg = df_a.merge(df_b[1:3], how='inner', on='Unique_Key')
df_mrg

produces: enter image description here

Notes:

  • the slice on df_b to create a subset
  • changed column names (spaces and symbols other than _ make my skin crawl)
  • I also manually eliminated leading and trailing whitespaces for cell values in Unique_Key (there are string methods that can automate)

Also consider that:

df_mrg = df_a.merge(df_b[1:3], how='right', on='Unique_Key')

will return the same dataframe as 'inner' for the present data, but could be something worth testing depending on your data and what you want to know.

Also, merge permits passing a list of columns. Since the source columns for your compound key are in both tables, you could test for potential problems with compound key by:

df_mrg2 = df_a.merge(df_b[1:3], how='inner', on=['ZIP','ADDRESS','STATE'])
np.where(df_mrg2['Unique_Key_x']==df_mrg2['Unique_Key_y'],True,False)

df_mrg2 return the same record set as df_mrg, but without duplication of the 'on' fields.

All this goes way beyond answering your question, but hope it helps

Upvotes: 1

Related Questions