identify common column values across two different sized dataframes in pandas

Question

I have two dataframes of different row and column sizes. I want to compare the two and create new columns in df2 based on whether values exist in df1. First for an example (I think you can copy/paste this text into a .csv to import), df1 looks like this:

subject block   target  dist1   dist2   dist3
7   1   doorlock    candleholder01  jar03   stroller
7   2   glassescase clownfish   kangaroo    ram
7   3   badger  chocolatefonduedish hosenozzle  toycar04
7   4   hyena   crocodile   pig toad
7   1   scooter cormorant   lizard  rockbass

df2 like this:

subject image
7   acorn
7   chainsaw
7   doorlock
7   stroller
7   bathtub
7   clownfish
7   bagtie
7   birdie
7   witchhat
7   crocodile
7   honeybee
7   electricitymeter
7   flowerwreath
7   jar03
7   camera02a

and what I'd like to achieve is this:

subject image   present type    block
7   acorn   0   NA  NA
7   chainsaw    0   NA  NA
7   doorlock    1   target  1
7   stroller    1   dist3   1
7   bathtub 0   NA  NA
7   clownfish   1   dist1   2
7   bagtie  0   NA  NA
7   birdie  0   NA  NA
7   witchhat    0   NA  NA
7   crocodile   1   dist1   4
7   honeybee    0   NA  NA
7   electricitymeter    0   NA  NA
7   flowerwreath    0   NA  NA
7   jar03   1   dist2   1
7   camera02a   0   NA  NA

Specifically, I would like to identify, from the 4 columns in df1 ('target', 'dist1', 'dist2', 'dist3'), which values exist in the 'image' column of df2, and then (1) generate a column (boolean or 0/1) in df2 indicating whether that value exists in df1, (2) generate a second column in df2 with the name of the column in which that item exists in df1 (i.e. 'target', 'dist1', ...), and finally (3) generate a column in df2 with the df1 'block' value from which that item came from, if any.

I hope this is clear. I'd also like some ideas on how to handle the cases that don't match - should I code these as NAN or just empty strings? The thing is I will probably be groupby()'ing later, and I had some problems with groupby() when the df contained missing values..

Ben · Accepted Answer

You can do this by using melt on df1 and merge.

df1 = df1.melt(id_vars=['subject', 'block'], var_name='type', value_name='image')
df2['present'] = df2['image'].isin(df1['image']).astype(int)
pd.merge(df2, df1[['image', 'type', 'block']], on='image', how='left')

    subject             image   present     type    block
0         7             acorn         0      NaN    NaN
1         7          chainsaw         0      NaN    NaN
2         7          doorlock         1   target    1.0
3         7          stroller         1    dist3    1.0
4         7           bathtub         0      NaN    NaN
5         7         clownfish         1    dist1    2.0
6         7            bagtie         0      NaN    NaN
7         7            birdie         0      NaN    NaN
8         7          witchhat         0      NaN    NaN
9         7         crocodile         1    dist1    4.0
10        7          honeybee         0      NaN    NaN
11        7  electricitymeter         0      NaN    NaN
12        7      flowerwreath         0      NaN    NaN
13        7             jar03         1    dist2    1.0
14        7         camera02a         0      NaN    NaN

As for the missing values, I would keep them as NaN. pandas is pretty powerful in terms of working with missing data, so may as well take advantage of this.

identify common column values across two different sized dataframes in pandas

Answers (1)

Related Questions