Umar.H
Umar.H

Reputation: 23099

Remove all rows that meet regex condition

trying to teach myself pandas.. and playing around with different dtypes

I have a df as follows

df = pd.DataFrame({'ID':[0,2,"bike","cake"], 'Course':['Test','Math','Store','History'] })
print(df)
    ID  Course
0   0   Test
1   2   Math
2   bike    Store
3   cake    History

the dtype of ID is of course an object. What I want to do is remove any rows in the DF if the ID has a string in it.

I thought this would be as simple as..

df.ID.filter(regex='[\w]*')

but this returns everything, is there a sure fire method for dealing with such things?

Upvotes: 7

Views: 7299

Answers (3)

user3483203
user3483203

Reputation: 51155

Wen's answer is the correct (and fastest) way to solve this, but to explain why your regular expression doesn't work, you have to understand what \w means.

\w matches any word character, which includes [a-zA-Z0-9_]. So what you're currently matching includes digits, so everything is matched. A valid regular expression approach would be:

df.loc[df.ID.astype(str).str.match(r'\d+')]

  ID Course
0  0   Test
1  2   Math

The second issue is your use of filter. It isn't filtering your ID row, it is filtering your index. A valid solution using filter would be as follows:

df.set_index('ID').filter(regex=r'^\d+$', axis=0)

   Course
ID
0    Test
2    Math

Upvotes: 5

pault
pault

Reputation: 43504

Another option is to convert the column to string and use str.match:

print(df[df['ID'].astype(str).str.match("\d+")])
#  Course ID
#0   Test  0
#1   Math  2

Your code does not work, because as stated in the docs for pandas.DataFrame.filter:

Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Upvotes: 5

BENY
BENY

Reputation: 323236

You can using to_numeric

df[pd.to_numeric(df.ID,errors='coerce').notnull()]
Out[450]: 
  Course ID
0   Test  0
1   Math  2

Upvotes: 6

Related Questions