Reputation: 693
I have the following dataset:
d1 = {'Indiv1':['Subject1','Subject2','Subject1','Subject1','Subject2','Subject1','Subject1','Subject2'],
'Indiv2': ['Subject4','Subject3','Subject2','Subject4','Subject4','Subject2','Subject3','Subject3'],
'Event':['1','1','2','2','2','3','3','3'],
'Category':['1','2','1','1','1','2','2','2'],
'Variable1':['1','2','3','4','5','6','7','8'],
'Variable2':['12','11','10','9','8','7','6','5'],
'Variable3': ['-4','-3','-2','-1','0','1','2','3']}
d1 = pd.DataFrame(d1)
d1=d1[['Indiv1','Indiv2','Event','Category','Variable1','Variable2','Variable3']]
d1
Which gives the following (In my dataset this particular file has over 2 million rows):
d1=
I have a second smaller dataset (about 1500 rows) in the following format:
d2 = {'Indiv1': ['Subject1','Subject3','Subject1','Subject4','Subject2','Subject1','Subject1','Subject2'],
'Indiv2': ['Subject4','Subject2','Subject6','Subject1','Subject1','Subject8','Subject9','Subject113'],
'Event':['1','1','2','2','2','3','3','3'],
'Category':['1','2','1','1','1','2','2','2']}
d2 = pd.DataFrame(d2)
d2=d2[['Indiv1','Indiv2','Event','Category']]
d2
which looks like the following:
d2=
what I need to do is to look up the pairs of subjects in each event in each category from the second file (d2). If the pair exixts in both d1 and d2 for the given event id and category, then assign 1 to the row in df1. Otherwise assign 0.
Note that in row 2 of df2, the order of the two individuals are flipped. Instead of Subject 2 and Subject 3 as in d1, in d2 we have Subject 3 and Subject 2. However, in my case I want to treat both as the same. In this case, I want to assign a value of 1 to these cases.
Finally there are pairs in d1 which are not in d2 (for each event, for each category). For example, for event 3, there is no pair with Subject1 and Subject 2 in df2 (although this exists in df1). In such a case, assign a value =0 under the column present. The final output will look like this:
I am especially very perplexed with how to do it. Especially in the case, when groupings may change (Subject1 and Subject 2) vs (Subject2 and Subject 1).
Any help will be appreciated. Ty in advance.
Please let me know if I amam unclear
Upvotes: 1
Views: 90
Reputation: 323226
Creating a key by using indiv1 and indiv2 with np.sort
, then using isin
(drop it by using d1.drop('key',1 inplace=True)
)
d2['key']=np.sort(d2.iloc[:,:2],axis=1).sum(1)+d2.Event.astype(str)
d1['key']=np.sort(d1.iloc[:,:2],axis=1).sum(1)+d1.Event.astype(str)
d1['persent']=d1.key.isin(d2.key).astype(int)
d1
Out[39]:
Indiv1 Indiv2 Event Category Variable1 Variable2 Variable3 \
0 Subject1 Subject4 1 1 1 12 -4
1 Subject2 Subject3 1 2 2 11 -3
2 Subject1 Subject2 2 1 3 10 -2
3 Subject1 Subject4 2 1 4 9 -1
4 Subject2 Subject4 2 1 5 8 0
5 Subject1 Subject2 3 2 6 7 1
6 Subject1 Subject3 3 2 7 6 2
7 Subject2 Subject3 3 2 8 5 3
key persent
0 Subject1Subject41 1
1 Subject2Subject31 1
2 Subject1Subject22 1
3 Subject1Subject42 1
4 Subject2Subject42 0
5 Subject1Subject23 0
6 Subject1Subject33 0
7 Subject2Subject33 0
Upvotes: 1