chip
chip

Reputation: 59

Remove a row in a pandas data frame if the data starts with a specific character

I have a data frame with some text read in from a txt file the column names are FEATURE and SENTENCES. Within the FEATURE col there is some text that starts with '[NA]', e.g. '[NA] not a feature'.

How can I remove those rows from my data frame?

So far I have tried:

df[~df.FEATURE.str.contains("[NA]")]

But this did nothing, no errors either.

I also tried:

df.drop(df['FEATURE'].str.startswith('[NA]'))

Again, there were no errors, but this didn't work.

Upvotes: 1

Views: 2874

Answers (4)

Joyson Martinraj
Joyson Martinraj

Reputation: 1

The below simply code should work

df = df[~df['Date'].str.startswith('[NA]')]

Upvotes: 0

jezrael
jezrael

Reputation: 862611

IIUC use regex=False for not parsing string like regex:

df[~df.FEATURE.str.contains("[NA]", regex=False)]

Or escape special regex chars []:

df[~df.FEATURE.str.contains("\[NA\]")]

Another problem should be trailing white spaces, then use:

df[~df['FEATURE'].str.strip().str.startswith('[NA]')]

Upvotes: 1

Karn Kumar
Karn Kumar

Reputation: 8816

Lets suppose you have DataFrame below:

>>> df
  FEATURE
0    this
1      is
2  string
3    [NA]

Then below simply should be sufficed ..

>>> df[~df['FEATURE'].str.startswith('[NA]')]
  FEATURE
0    this
1      is
2  string

other way in case data needed to formatted to string before operating on it..

df[~df['FEATURE'].astype(str).str.startswith('[NA]')]

OR using str.contains :

>>> df[df.FEATURE.str.contains('[NA]') == False]
  # df[df['FEATURE'].str.contains('[NA]') == False]
  FEATURE
0    this
1      is
2  string

OR

df[df.FEATURE.str[0].ne('[')]

Upvotes: 2

Daniel Redgate
Daniel Redgate

Reputation: 249

df['data'].str.startswith('[NA]') or df['data'].str.contains('[NA]') will both return a boolean (True/False) list. Drop doesnt work with booleans and in this case it is easiest using 'loc'

Here is one solution with some example data. Note that i add '==False' to get all the rows that DON'T have [NA]:

df = pd.DataFrame(['feature','feature2', 'feature3', '[NA] not feature', '[NA] not feature2'], columns=['data'])

mask = df['data'].str.contains('[NA]')==False
df.loc[mask]

Upvotes: 0

Related Questions