lbug
lbug

Reputation: 371

Pandas if statements

I want to automate a data validation process using Pandas. This involves two steps:

The first is making sure that each row has a valid value for a column. For example, I want to check that in the column 'Independent' that row has either a 1 or a 0, and that it throws a flag if it does not. So if Independent==1 or Independent==0 IndepFlag=0 else IndepFlag=1 (For that row). Based on this flag, I want to print that Row's ID to an excel file.

The second is doing something similar, but checking multiple columns (e.g. if Independent==1 and Column2>1 Column2Flag=1.

My question is, how do I check for a valid value and then capture and print from the column that has the row's ID, and how do I do this with multiple columns?

Upvotes: 0

Views: 632

Answers (1)

Brian from QuantRocket
Brian from QuantRocket

Reputation: 5493

You can accomplish this using where: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking

In [22]: df = pd.DataFrame(dict(Independent=[-1,0,1,4,0], Column2=[0,1,0,2,2]))

In [23]: df
Out[23]:
   Column2  Independent
0        0           -1
1        1            0
2        0            1
3        2            4
4        2            0

where converts values that don't meet your criteria to nan. You can use notnull() to get a boolean Series and change it to ints if you like using astype:

In [24]: df["IndepFlag"] = df.Independent.where((df.Independent == 0) | (df.Independent == 1)).notnull().astype(int)

In [25]: df
Out[25]:
   Column2  Independent  IndepFlag
0        0           -1          0
1        1            0          1
2        0            1          1
3        2            4          0
4        2            0          1

Then you can select only the rows that interest you and write them to an excel file:

In [26]: flagged = df[df.IndepFlag == 1]

In [27]: flagged
Out[27]:
   Column2  Independent  IndepFlag
1        1            0          1
2        0            1          1
4        2            0          1

In [28]: flagged.to_excel("flagged.xlsx")

Using multiple columns is not very different, just change where to use (df.Independent == 0) & (df.Column2 > 1).

Upvotes: 1

Related Questions