Reputation: 2765
I have a dataframe with a column ClientAccount
which contains a lot of test data data which I want to filter out.
To find how many rows contains test clients, I do the following:
test_users = order_data[order_data['ClientAccount'].str.contains("DEMO|test")==True]
Which returns Name: ClientAccount, Length: 2493
Cool, so 2.493 rows out of 71.458 original rows.
Then to get get everything that isn't these 2.493 rows, shouldn't I just do the opposite?
order_data = order_data[order_data['ClientAccount'].str.contains("DEMO|test")==False]
This gives 48.046 rows though, but how does that make sense? What am I missing?
Upvotes: 1
Views: 110
Reputation: 862771
I think there are NaN
or None
values, so is possible use parameter na
in str.contains
. Also for inversing boolean mask (True
+ False
Series
) use ~
:
mask = order_data['ClientAccount'].str.contains("DEMO|test", na=False)
test_users1 = order_data[mask]
test_users2 = order_data[~mask]
Sample:
order_data = pd.DataFrame({'ClientAccount':['DEMO ss','test f','dfd', None, np.nan, 'test']})
print (order_data)
ClientAccount
0 DEMO ss
1 test f
2 dfd
3 None
4 NaN
5 test
mask = order_data['ClientAccount'].str.contains("DEMO|test", na=False)
test_users1 = order_data[mask]
test_users2 = order_data[~mask]
print (test_users1)
ClientAccount
0 DEMO ss
1 test f
5 test
print (test_users2)
ClientAccount
2 dfd
3 None
4 NaN
Without parameter I get error:
mask = order_data['ClientAccount'].str.contains("DEMO|test")
ValueError: cannot index with vector containing NA / NaN values
Upvotes: 1