Ward Nieboer
Ward Nieboer

Reputation: 15

Pandas dataframe compare multiple rows with specific condition

I hope I can explain my problem correctly.. I have a dataframe (306x40) with multiple rows containing data of certain group, I need to group them by index, that's no problem. Next I need to compare the rows with another row that has a specific condition. Take this data as an example:

Id  Condition   var1   var2   var3
1      1         0      1      0
1      3         1      1      0
2      2         0      0      1
2      3         0      0      1
2      1         0      0      1

So I need to compare var1, var2, var3 per Id with the row that has condition 3.. The outcome would let me know that any of the variables is different than the variables in condition 3. Possible outputs:

Id  Condition   var1   var2   var3  count_false
1      1        false  true   true      1
1      3        NaN    NaN    NaN       NaN
2      2        true   true   true      0
2      3        NaN    NaN    NaN       NaN
2      1        true   true   true      0

Or just simply saying that condition x is not the same as condition 3 for y Id

I hope you guys understand what I need, but I'm happy to elaborate

Thank you so much in advance!

edit to make it more clear:

The data consists of data from 3 different methods to detect bacteria in patients (clinical data), note that per patient it differs which method was used and how many times. So I have method 1, 2, 3, these are the different conditions. The variables are the different kinds of bacteria found. Method 1 and 2 are the golden standard and method 3 is the one that needs to be validated. So I want to see if method 3 gives the same result as method 1 or 2. the 30+ variables are the bacteria and value 1 indicates present bacteria and 0 not present.

Upvotes: 0

Views: 2025

Answers (2)

user3774410
user3774410

Reputation: 16

#x!=x.loc[i] gives you the truth value of each row when comparing with your desired(condition) row. Rest of the code is below:

df=pd.DataFrame([[1,1,0,1,0],[1,3,1,1,0],[2,2, 0,0,1],[2,3, 0,0,1],[2,1, 0,0,1]],columns=['ID','Condition','var1','var2','var3'])
def comp(x):
    i=x.index[x['Condition']==3][0]
    #print(x!=x.loc[i])
    truth=(x[['var1','var2','var3']]==x.loc[i,['var1','var2','var3']])
    x=pd.concat([x[['ID','Condition']],truth],axis=1)
    return x
grp=df.groupby('ID').apply(comp)
grp['count_false']=grp[['var1','var2','var3']].apply(lambda x: 3-x.sum(),axis=1)
print(grp)

Upvotes: 0

Pierre D
Pierre D

Reputation: 26221

If I understand correctly, this is just about counting how many var1 .. var2 are different than var3, row by row. There is no grouping involved.

If that's the case (happy to modify if not), then:

cols = 'var1 var2 var3'.split()  # can be extended to more variables
df['count_diff'] = (df[cols[:-1]].values != df[cols[-1]].values[:, None]).sum(axis=1)

On a modified example:

df = pd.DataFrame({
    'Id': [1, 1, 2, 2, 2],
    'Condition': [1, 3, 2, 3, 1],
    'var1': [0, 1, 0, 1, 0],
    'var2': [1, 1, 0, 1, 0],
    'var3': [0, 0, 1, 1, 1]})

cols = 'var1 var2 var3'.split()
df['count_diff'] = (df[cols[:-1]].values != df[cols[-1]].values[:, None]).sum(axis=1)
print(df)

# output:
   Id  Condition  var1  var2  var3  count_diff
0   1          1     0     1     0           1
1   1          3     1     1     0           2
2   2          2     0     0     1           2
3   2          3     1     1     1           0
4   2          1     0     0     1           2

Upvotes: 1

Related Questions