Reputation: 333
I have following 2 data frames that are taken from excel files:
df_a = 10000 rows (like the master list that has all unique #s)
df_b = 670 rows
I am loading a excel file (df_b) that has zip, address, state and I want to match that info and then add on the supplier # from df_a so that I could have 1 file thats still 670 rows but now has the supplier row column.
df_a =
(10000 rows)
(unique)
supplier # ZIP ADDRESS STATE Unique Key
0 7100000 35481 14th street CA 35481-14th street-CA
1 7000005 45481 14th street CA 45481-14th street-CA
2 7000006 45482 140th circle CT 45482-140th circle-CT
3 7000007 35482 140th circle CT 35482-140th circle-CT
4 7000008 35483 13th road VT 35483-13th road-VT
df_b =
(670 rows)
ZIP ADDRESS STATE Unique Key
0 35481 14th street CA 35481-14th street-CA
1 45481 14th street CA 45481-14th street-CA
2 45482 140th circle CT 45482-140th circle-CT
3 35482 140th circle CT 35482-140th circle-CT
4 35483 13th road VT 35483-13th road-VT
OUTPUT:
df_c =
(670 rows)
ZIP ADDRESS STATE Unique Key (Unique)supplier #
0 35481 14th street CA 35481-14th street-CA 7100000
1 45481 14th street CA 45481-14th street-CA 7100005
2 45482 140th circle CT 45482-140th circle-CT 7100006
3 35482 140th circle CT 35482-140th circle-CT 7100007
4 35483 13th road VT 35483-13th road-VT 7100008
I tried merging the 2 dfs together but they are not matching and instead im getting a bunch of NAn
df10 = df_a.merge(df_b, on = 'Unique Key', how= 'left'
The result is 1 data frame with lots of columns and no matches. Also, Ive tried .map and .concat as well. I'm not sure whats going on.
Upvotes: 0
Views: 132
Reputation: 2329
have you tried
df10 = df_a.merge(df_b, on = 'Unique Key', how = "inner")
an 'inner join' retains only common records which, IIUC, is what you're trying to achieve
creating csvs from your test data and reading into pandas
df_mrg = df_a.merge(df_b[1:3], how='inner', on='Unique_Key')
df_mrg
Notes:
Also consider that:
df_mrg = df_a.merge(df_b[1:3], how='right', on='Unique_Key')
will return the same dataframe as 'inner' for the present data, but could be something worth testing depending on your data and what you want to know.
Also, merge permits passing a list of columns. Since the source columns for your compound key are in both tables, you could test for potential problems with compound key by:
df_mrg2 = df_a.merge(df_b[1:3], how='inner', on=['ZIP','ADDRESS','STATE'])
np.where(df_mrg2['Unique_Key_x']==df_mrg2['Unique_Key_y'],True,False)
df_mrg2 return the same record set as df_mrg, but without duplication of the 'on' fields.
All this goes way beyond answering your question, but hope it helps
Upvotes: 1