fffrost
fffrost

Reputation: 1767

identify common column values across two different sized dataframes in pandas

I have two dataframes of different row and column sizes. I want to compare the two and create new columns in df2 based on whether values exist in df1. First for an example (I think you can copy/paste this text into a .csv to import), df1 looks like this:

subject block   target  dist1   dist2   dist3
7   1   doorlock    candleholder01  jar03   stroller
7   2   glassescase clownfish   kangaroo    ram
7   3   badger  chocolatefonduedish hosenozzle  toycar04
7   4   hyena   crocodile   pig toad
7   1   scooter cormorant   lizard  rockbass

df2 like this:

subject image
7   acorn
7   chainsaw
7   doorlock
7   stroller
7   bathtub
7   clownfish
7   bagtie
7   birdie
7   witchhat
7   crocodile
7   honeybee
7   electricitymeter
7   flowerwreath
7   jar03
7   camera02a

and what I'd like to achieve is this:

subject image   present type    block
7   acorn   0   NA  NA
7   chainsaw    0   NA  NA
7   doorlock    1   target  1
7   stroller    1   dist3   1
7   bathtub 0   NA  NA
7   clownfish   1   dist1   2
7   bagtie  0   NA  NA
7   birdie  0   NA  NA
7   witchhat    0   NA  NA
7   crocodile   1   dist1   4
7   honeybee    0   NA  NA
7   electricitymeter    0   NA  NA
7   flowerwreath    0   NA  NA
7   jar03   1   dist2   1
7   camera02a   0   NA  NA

Specifically, I would like to identify, from the 4 columns in df1 ('target', 'dist1', 'dist2', 'dist3'), which values exist in the 'image' column of df2, and then (1) generate a column (boolean or 0/1) in df2 indicating whether that value exists in df1, (2) generate a second column in df2 with the name of the column in which that item exists in df1 (i.e. 'target', 'dist1', ...), and finally (3) generate a column in df2 with the df1 'block' value from which that item came from, if any.

I hope this is clear. I'd also like some ideas on how to handle the cases that don't match - should I code these as NAN or just empty strings? The thing is I will probably be groupby()'ing later, and I had some problems with groupby() when the df contained missing values..

Upvotes: 0

Views: 52

Answers (1)

Ben
Ben

Reputation: 836

You can do this by using melt on df1 and merge.

df1 = df1.melt(id_vars=['subject', 'block'], var_name='type', value_name='image')
df2['present'] = df2['image'].isin(df1['image']).astype(int)
pd.merge(df2, df1[['image', 'type', 'block']], on='image', how='left')

    subject             image   present     type    block
0         7             acorn         0      NaN    NaN
1         7          chainsaw         0      NaN    NaN
2         7          doorlock         1   target    1.0
3         7          stroller         1    dist3    1.0
4         7           bathtub         0      NaN    NaN
5         7         clownfish         1    dist1    2.0
6         7            bagtie         0      NaN    NaN
7         7            birdie         0      NaN    NaN
8         7          witchhat         0      NaN    NaN
9         7         crocodile         1    dist1    4.0
10        7          honeybee         0      NaN    NaN
11        7  electricitymeter         0      NaN    NaN
12        7      flowerwreath         0      NaN    NaN
13        7             jar03         1    dist2    1.0
14        7         camera02a         0      NaN    NaN

As for the missing values, I would keep them as NaN. pandas is pretty powerful in terms of working with missing data, so may as well take advantage of this.

Upvotes: 2

Related Questions