Reputation: 371
I want to automate a data validation process using Pandas. This involves two steps:
The first is making sure that each row has a valid value for a column. For example, I want to check that in the column 'Independent' that row has either a 1 or a 0, and that it throws a flag if it does not. So if Independent==1 or Independent==0 IndepFlag=0 else IndepFlag=1 (For that row). Based on this flag, I want to print that Row's ID to an excel file.
The second is doing something similar, but checking multiple columns (e.g. if Independent==1 and Column2>1 Column2Flag=1.
My question is, how do I check for a valid value and then capture and print from the column that has the row's ID, and how do I do this with multiple columns?
Upvotes: 0
Views: 632
Reputation: 5493
You can accomplish this using where
: http://pandas.pydata.org/pandas-docs/dev/indexing.html#the-where-method-and-masking
In [22]: df = pd.DataFrame(dict(Independent=[-1,0,1,4,0], Column2=[0,1,0,2,2]))
In [23]: df
Out[23]:
Column2 Independent
0 0 -1
1 1 0
2 0 1
3 2 4
4 2 0
where
converts values that don't meet your criteria to nan
. You can use notnull()
to get a boolean Series and change it to ints if you like using astype
:
In [24]: df["IndepFlag"] = df.Independent.where((df.Independent == 0) | (df.Independent == 1)).notnull().astype(int)
In [25]: df
Out[25]:
Column2 Independent IndepFlag
0 0 -1 0
1 1 0 1
2 0 1 1
3 2 4 0
4 2 0 1
Then you can select only the rows that interest you and write them to an excel file:
In [26]: flagged = df[df.IndepFlag == 1]
In [27]: flagged
Out[27]:
Column2 Independent IndepFlag
1 1 0 1
2 0 1 1
4 2 0 1
In [28]: flagged.to_excel("flagged.xlsx")
Using multiple columns is not very different, just change where
to use (df.Independent == 0) & (df.Column2 > 1)
.
Upvotes: 1