Prometheus
Prometheus

Reputation: 693

Python find pairs when from subset (pair numbers are changing)

I have the following dataset:

d1 = {'Indiv1':['Subject1','Subject2','Subject1','Subject1','Subject2','Subject1','Subject1','Subject2'],
'Indiv2': ['Subject4','Subject3','Subject2','Subject4','Subject4','Subject2','Subject3','Subject3'],
'Event':['1','1','2','2','2','3','3','3'],
'Category':['1','2','1','1','1','2','2','2'],
'Variable1':['1','2','3','4','5','6','7','8'],
'Variable2':['12','11','10','9','8','7','6','5'],
'Variable3': ['-4','-3','-2','-1','0','1','2','3']}
d1 = pd.DataFrame(d1)  
d1=d1[['Indiv1','Indiv2','Event','Category','Variable1','Variable2','Variable3']]
d1

Which gives the following (In my dataset this particular file has over 2 million rows):

d1=

enter image description here

I have a second smaller dataset (about 1500 rows) in the following format:

d2 = {'Indiv1': ['Subject1','Subject3','Subject1','Subject4','Subject2','Subject1','Subject1','Subject2'],
'Indiv2': ['Subject4','Subject2','Subject6','Subject1','Subject1','Subject8','Subject9','Subject113'],
'Event':['1','1','2','2','2','3','3','3'],
'Category':['1','2','1','1','1','2','2','2']}
d2 = pd.DataFrame(d2)
d2=d2[['Indiv1','Indiv2','Event','Category']]
d2

which looks like the following:

d2=

enter image description here

what I need to do is to look up the pairs of subjects in each event in each category from the second file (d2). If the pair exixts in both d1 and d2 for the given event id and category, then assign 1 to the row in df1. Otherwise assign 0.

enter image description here

Note that in row 2 of df2, the order of the two individuals are flipped. Instead of Subject 2 and Subject 3 as in d1, in d2 we have Subject 3 and Subject 2. However, in my case I want to treat both as the same. In this case, I want to assign a value of 1 to these cases.

enter image description here

Finally there are pairs in d1 which are not in d2 (for each event, for each category). For example, for event 3, there is no pair with Subject1 and Subject 2 in df2 (although this exists in df1). In such a case, assign a value =0 under the column present. The final output will look like this:

enter image description here

I am especially very perplexed with how to do it. Especially in the case, when groupings may change (Subject1 and Subject 2) vs (Subject2 and Subject 1).

Any help will be appreciated. Ty in advance.

Please let me know if I amam unclear

Upvotes: 1

Views: 90

Answers (1)

BENY
BENY

Reputation: 323226

Creating a key by using indiv1 and indiv2 with np.sort, then using isin(drop it by using d1.drop('key',1 inplace=True))

d2['key']=np.sort(d2.iloc[:,:2],axis=1).sum(1)+d2.Event.astype(str)
d1['key']=np.sort(d1.iloc[:,:2],axis=1).sum(1)+d1.Event.astype(str)
d1['persent']=d1.key.isin(d2.key).astype(int)
d1
Out[39]: 
     Indiv1    Indiv2 Event Category Variable1 Variable2 Variable3  \
0  Subject1  Subject4     1        1         1        12        -4   
1  Subject2  Subject3     1        2         2        11        -3   
2  Subject1  Subject2     2        1         3        10        -2   
3  Subject1  Subject4     2        1         4         9        -1   
4  Subject2  Subject4     2        1         5         8         0   
5  Subject1  Subject2     3        2         6         7         1   
6  Subject1  Subject3     3        2         7         6         2   
7  Subject2  Subject3     3        2         8         5         3   
                 key  persent  
0  Subject1Subject41        1  
1  Subject2Subject31        1  
2  Subject1Subject22        1  
3  Subject1Subject42        1  
4  Subject2Subject42        0  
5  Subject1Subject23        0  
6  Subject1Subject33        0  
7  Subject2Subject33        0  

Upvotes: 1

Related Questions