Reputation: 1767
I have two dataframes of different row and column sizes. I want to compare the two and create new columns in df2
based on whether values exist in df1
. First for an example (I think you can copy/paste this text into a .csv to import), df1
looks like this:
subject block target dist1 dist2 dist3
7 1 doorlock candleholder01 jar03 stroller
7 2 glassescase clownfish kangaroo ram
7 3 badger chocolatefonduedish hosenozzle toycar04
7 4 hyena crocodile pig toad
7 1 scooter cormorant lizard rockbass
df2
like this:
subject image
7 acorn
7 chainsaw
7 doorlock
7 stroller
7 bathtub
7 clownfish
7 bagtie
7 birdie
7 witchhat
7 crocodile
7 honeybee
7 electricitymeter
7 flowerwreath
7 jar03
7 camera02a
and what I'd like to achieve is this:
subject image present type block
7 acorn 0 NA NA
7 chainsaw 0 NA NA
7 doorlock 1 target 1
7 stroller 1 dist3 1
7 bathtub 0 NA NA
7 clownfish 1 dist1 2
7 bagtie 0 NA NA
7 birdie 0 NA NA
7 witchhat 0 NA NA
7 crocodile 1 dist1 4
7 honeybee 0 NA NA
7 electricitymeter 0 NA NA
7 flowerwreath 0 NA NA
7 jar03 1 dist2 1
7 camera02a 0 NA NA
Specifically, I would like to identify, from the 4 columns in df1 ('target', 'dist1', 'dist2', 'dist3'
), which values exist in the 'image'
column of df2
, and then (1) generate a column (boolean or 0/1) in df2
indicating whether that value exists in df1
, (2) generate a second column in df2
with the name of the column in which that item exists in df1
(i.e. 'target', 'dist1'
, ...), and finally (3) generate a column in df2
with the df1
'block' value from which that item came from, if any.
I hope this is clear. I'd also like some ideas on how to handle the cases that don't match - should I code these as NAN
or just empty strings? The thing is I will probably be groupby()
'ing later, and I had some problems with groupby()
when the df contained missing values..
Upvotes: 0
Views: 52
Reputation: 836
You can do this by using melt
on df1
and merge
.
df1 = df1.melt(id_vars=['subject', 'block'], var_name='type', value_name='image')
df2['present'] = df2['image'].isin(df1['image']).astype(int)
pd.merge(df2, df1[['image', 'type', 'block']], on='image', how='left')
subject image present type block
0 7 acorn 0 NaN NaN
1 7 chainsaw 0 NaN NaN
2 7 doorlock 1 target 1.0
3 7 stroller 1 dist3 1.0
4 7 bathtub 0 NaN NaN
5 7 clownfish 1 dist1 2.0
6 7 bagtie 0 NaN NaN
7 7 birdie 0 NaN NaN
8 7 witchhat 0 NaN NaN
9 7 crocodile 1 dist1 4.0
10 7 honeybee 0 NaN NaN
11 7 electricitymeter 0 NaN NaN
12 7 flowerwreath 0 NaN NaN
13 7 jar03 1 dist2 1.0
14 7 camera02a 0 NaN NaN
As for the missing values, I would keep them as NaN
. pandas is pretty powerful in terms of working with missing data, so may as well take advantage of this.
Upvotes: 2