Khaine775
Khaine775

Reputation: 2765

Pandas column filtering on string gives unexpected results

I have a dataframe with a column ClientAccount which contains a lot of test data data which I want to filter out.

To find how many rows contains test clients, I do the following:

test_users = order_data[order_data['ClientAccount'].str.contains("DEMO|test")==True]

Which returns Name: ClientAccount, Length: 2493

Cool, so 2.493 rows out of 71.458 original rows.

Then to get get everything that isn't these 2.493 rows, shouldn't I just do the opposite?

order_data = order_data[order_data['ClientAccount'].str.contains("DEMO|test")==False]

This gives 48.046 rows though, but how does that make sense? What am I missing?

Upvotes: 1

Views: 110

Answers (1)

jezrael
jezrael

Reputation: 862771

I think there are NaN or None values, so is possible use parameter na in str.contains. Also for inversing boolean mask (True + False Series) use ~:

mask = order_data['ClientAccount'].str.contains("DEMO|test", na=False)

test_users1 = order_data[mask]
test_users2 = order_data[~mask]

Sample:

order_data = pd.DataFrame({'ClientAccount':['DEMO ss','test f','dfd', None, np.nan, 'test']})
print (order_data)
  ClientAccount
0       DEMO ss
1        test f
2           dfd
3          None
4           NaN
5          test

mask = order_data['ClientAccount'].str.contains("DEMO|test", na=False)

test_users1 = order_data[mask]
test_users2 = order_data[~mask]

print (test_users1)
  ClientAccount
0       DEMO ss
1        test f
5          test

print (test_users2)
  ClientAccount
2           dfd
3          None
4           NaN

Without parameter I get error:

mask = order_data['ClientAccount'].str.contains("DEMO|test")

ValueError: cannot index with vector containing NA / NaN values

Upvotes: 1

Related Questions