Reputation: 3657
I have a csv that I'm loading into a dataframe. I only need the rows for which the Organization
column contains a target string affiliation
.
When I try to use str.contains()
I get ValueError: cannot index with vector containing NA / NaN values
.
I've looked at Value Error when Slicing in Pandas and pandas + dataframe - select by partial string and the following solution that both have works for me:
df = df[df['Organization'].str.contains(affiliation)==True]
or
df = df[df['Organization'].str.contains(affiliation).fillna(False)]
But, as a test, I did this:
print(len(df)) #99228
df = df[pd.notnull(df['Organization'])] #or df = df.dropna(subset=['Organization'])
print(len(df)) #99228
df = df[df['Organization'].str.contains(affiliation).fillna(False)]
print(len(df)) #1605
My question is: the ValueError
I was getting without ==True
or fillna(False)
attached to str.contains()
seems to imply that the Organization
column has NaNs
. But then why do I get the same sized df after keeping only the non-null Organization
rows? What am I missing here?
Thanks!
Upvotes: 0
Views: 271
Reputation: 81
You need to specify str.contains('affiliation', na=False)
. [docs]
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: s = pd.Series(['foo','bar',np.nan,'this contains affiliation','baz',np.nan])
In [4]: s.str.contains('affiliation')
Out[4]:
0 False
1 False
2 NaN
3 True
4 False
5 NaN
dtype: object
In [5]: s.str.contains('affiliation', na=False)
Out[5]:
0 False
1 False
2 False
3 True
4 False
5 False
dtype: bool
Then you can index your DataFrame with that Boolean array.
Upvotes: 1
Reputation: 85482
Check the content of your column Organization
. It very likely contains strings and other data types. Therefore, df['Organization'].str.contains(affiliation)
results in NaN
for these values with other data types. You cannot use NaN
for indexing but need to convert it into False
.
Upvotes: 1