owwoow14
owwoow14

Reputation: 1754

Filter all rows that do not contain letters (alpha) in ´pandas´

I am trying to filter a pandas dataframe using regular expressions. I want to delete those rows that do not contain any letters. For example:

Col A.
50000
$927848
dog
cat 583
rabbit 444

My desired results is:

Col A.
dog
cat 583
rabbit 444

I have been trying to solve this problem unsuccessful with regex and pandas filter options. See blow. I am specifically running into problems when I try to merge two conditions for the filter. How can I achieve this?

Option 1:

df['Col A.'] = ~df['Col A.'].filter(regex='\d+')

Option 2

df['Col A.'] = df['Col A.'].filter(regex=\w+)

Option 3

from string import digits, letters
df['Col A.'] = (df['Col A.'].filter(regex='|'.join(letters)))

OR

df['Col A.'] = ~(df['Col A.'].filter(regex='|'.join(digits)))

OR

df['Col A.'] = df[~(df['Col A.'].filter(regex='|'.join(digits))) & (df['Col A.'].filter(regex='|'.join(letters)))]

Upvotes: 8

Views: 20925

Answers (4)

jezrael
jezrael

Reputation: 862921

I think you'd need str.contains to filter values which contain letters by the means of boolean indexing:

df =  df[df['Col A.'].str.contains('[A-Za-z]')]
print (df)
       Col A.
2         dog
3     cat 583
4  rabbit 444

If there are some NaNs values you can pass a parameter:

df = df[df['Col A.'].str.contains('[A-Za-z]', na=False)]    
print (df)
       Col A.
3         dog
4     cat 583
5  rabbit 444

Upvotes: 25

Vaidic
Vaidic

Reputation: 21

df['Col A.'].str.contains(r'^\d+$', na=True) # if string with only digits or if int/float then will result in NaN converted to True

eg: [50000, '$927848', 'dog', 'cat 583', 'rabbit 444', '3 e 3', 'e 3', '33', '3 e'] will give : [True,False,False,False,False,False,False, True,False]

Upvotes: 2

Benoît Zu
Benoît Zu

Reputation: 1287

You can use ^.*[a-zA-Z].*$

https://regex101.com/r/b84ji1/1

Details

^: Start of the line

.*: Match any character

[a-zA-Z]: Match letters

$: End of the line

Upvotes: 1

Samuel GIFFARD
Samuel GIFFARD

Reputation: 842

Have you tried:

df['Col A.'].filter(regex=r'\D')  # Keeps only if there's a non-digit character

or:

df['Col A.'].filter(regex=r'[A-Za-z]')  # Keeps only if there's a letter (alpha)

or:

df['Col A.'].filter(regex=r'[^\W\d_]')  # More info in the link below...

Explanation: https://stackoverflow.com/a/2039476/8933502

Upvotes: 3

Related Questions