brucezepplin
brucezepplin

Reputation: 9752

python use test if value of a pandas dataframe is in membership of a set denoted by another column

if I have the following csv file test.csv:

C01,45,A,R
C02,123,H,I

where I have define sets R and I as

R=set(['R','E','D','N','P','H','K'])
I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])

I want to be able to test if the string A is a member of set R (which is false) and if string H is a member of set I (which is true). I have tried to do this with the following script:

#!/usr/bin/env python
import pandas as pd

I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])
R=set(['R','E','D','N','P','H','K'])

with open(test.csv) as f:
    table = pd.read_table(f, sep=',', header=None, lineterminator='\n')
table[table.columns[3]].astype(str).isin(table[table.columns[4]].astype(str))

i.e. I am trying to do the equivalent of A in R or rather table.columns[3] in table.columns[4] and return TRUE or FALSE for each row of data.

The only problem is that using the final line the two rows return TRUE. If I change the final line to

table[table.columns[3]].astype(str).isin(R)

Then I get

0   FALSE
1   TRUE

which is correct. It seems that I am not referencing the set name correctly when doing .isin(table[table.columns[3]].astype(str))

any ideas?

Upvotes: 0

Views: 749

Answers (1)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95948

Starting with the following:

In [21]: df
Out[21]: 
     0    1  2  3
0  C01   45  A  R
1  C02  123  H  I

In [22]: R=set(['R','E','D','N','P','H','K'])
    ...: I=set(['I','H','G','F','A','C','L','M','P','Q','S','T','V','W','Y'])
    ...: 

You could do something like this:

In [23]: sets = {"R":R,"I":I}

In [24]: df.apply(lambda S: S[2] in sets[S[3]],axis=1)
Out[24]: 
0    False
1     True
dtype: bool

Fair warning, .apply is slow and doesn't scale with larger data very well. It is there for convenience and a last resort.

Upvotes: 0

Related Questions