Reputation: 59
I have a data frame with some text read in from a txt file the column names are FEATURE
and SENTENCES
.
Within the FEATURE
col there is some text that starts with '[NA]'
, e.g. '[NA] not a feature'
.
How can I remove those rows from my data frame?
So far I have tried:
df[~df.FEATURE.str.contains("[NA]")]
But this did nothing, no errors either.
I also tried:
df.drop(df['FEATURE'].str.startswith('[NA]'))
Again, there were no errors, but this didn't work.
Upvotes: 1
Views: 2874
Reputation: 1
The below simply code should work
df = df[~df['Date'].str.startswith('[NA]')]
Upvotes: 0
Reputation: 862611
IIUC use regex=False
for not parsing string like regex:
df[~df.FEATURE.str.contains("[NA]", regex=False)]
Or escape special regex chars []
:
df[~df.FEATURE.str.contains("\[NA\]")]
Another problem should be trailing white spaces, then use:
df[~df['FEATURE'].str.strip().str.startswith('[NA]')]
Upvotes: 1
Reputation: 8816
Lets suppose you have DataFrame below:
>>> df
FEATURE
0 this
1 is
2 string
3 [NA]
Then below simply should be sufficed ..
>>> df[~df['FEATURE'].str.startswith('[NA]')]
FEATURE
0 this
1 is
2 string
other way in case data needed to formatted to string before operating on it..
df[~df['FEATURE'].astype(str).str.startswith('[NA]')]
OR using str.contains
:
>>> df[df.FEATURE.str.contains('[NA]') == False]
# df[df['FEATURE'].str.contains('[NA]') == False]
FEATURE
0 this
1 is
2 string
OR
df[df.FEATURE.str[0].ne('[')]
Upvotes: 2
Reputation: 249
df['data'].str.startswith('[NA]') or df['data'].str.contains('[NA]') will both return a boolean (True/False) list. Drop doesnt work with booleans and in this case it is easiest using 'loc'
Here is one solution with some example data. Note that i add '==False' to get all the rows that DON'T have [NA]:
df = pd.DataFrame(['feature','feature2', 'feature3', '[NA] not feature', '[NA] not feature2'], columns=['data'])
mask = df['data'].str.contains('[NA]')==False
df.loc[mask]
Upvotes: 0