multigoodverse
multigoodverse

Reputation: 8072

Concatenate dataframe rows and match when key is the same

I have two dataframe, df1 and df2 and I am trying to figure out a way to generate df3 as you see in the screenshot: enter image description here

So, the goal here is to retain all the rows of df1 and append rows of df2 under them. However, I want to have one single row for matching Name, Lat and Lon. So, Name, Lat and Lon will be used as a keys.

There's also the issue of the ZIP column. I want to retain the ZIP value of df1 when for rows that are joined.

I tried:

df3=pandas.merge(df1,df2,on=['Name','Lat','Lon'],how='outer')

That produced something close to what I want:

enter image description here

As you see the above dataframe came up with two different ZIP and Address columns.

Any idea on how to get the clean df3 dataframe?

Upvotes: 2

Views: 3398

Answers (1)

Sam
Sam

Reputation: 4090

I dont think 'merge' is appropriate for this task (ie, joining left DF on right DF), since you are really putting one DF on top of another and then dropping the duplicates. So you could try someting like:

#put one DF 'on top' of the other (like-named columns should drop into place)
df3 = pandas.concat([df1, df2])
#get rid of any duplicates
df3.drop_duplicates(inplace = True)

EDIT

With your feedback I realize a bit dirtier solution is needed. You would use a merge and then fill the NaN's from the duplicate columns. Something like

df1 = pd.DataFrame({'test':[1,2,3,6,np.nan, np.nan]})
df2 = pd.DataFrame({'test':[np.nan,np.nan,3,6,10,24]})

#some merge statement to get them into together into the var 'df'
df = pd.merge(df1, df2, left_index = True, right_index=True)

#collect the _x columns
original_cols = [x for x in df.columns if x.endswith('_x')]

for col in original_cols:
    #use the duplicate column to fill the NaN's of the original column
    duplicate = col.replace('_x', '_y')
    df[col].fillna(df[duplicate], inplace = True)

    #drop the duplicate
    df.drop(duplicate, axis = 1, inplace = True)

    #rename the original to remove the '_x'
    df.rename(columns = {col:col.replace('_x', '')}, inplace = True)

let me know if this works.

Upvotes: 2

Related Questions