Reputation: 165
Here is what I have:
import re
import pandas as pd
d = {'ID': [1, 2, 3, 4, 5], 'Desc': ['0*1***HHCM', 'HC:83*20', 'HC:5*2CASL', 'DM*72\nCAS*', 'HC:564*CAS*5']}
df = pd.DataFrame(data=d)
df
Output:
ID Desc
0 1 0*1***HHCM
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
I need to filter the dataframe by column "Desc", if it contains "CAS" or "HC" that are not surrounded by letters or digits.
Here is what I tried:
new_df = df[df['Desc'].str.match(r'[^A-Za-z0-9]CAS[^A-Za-z0-9]|[^A-Za-z0-9]HC[^A-Za-z0-9]') == True]
It returns an empty dataframe.
I want it to return the following:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
Another thing: since 3rd row has "\nCas", where "\n" is a line separator, will it treat it as a letter before "CAS"?
Please help.
Upvotes: 1
Views: 2057
Reputation: 13888
Try this:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=re.M)]
# If you don't want to import re you can just use flags=8:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=8)]
Result:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
To answer your other question, as long as \n
is passed correctly it will be parsed as a newline character instead of an alphanumeric character n
. i.e.:
r'\n' -> `\\n` (backslash character + n character)
'\n' -> '\n' (newline character)
For further explanation on the regex, please see Regex101 demo: https://regex101.com/r/FNBgPV/2
Upvotes: 1
Reputation: 7604
You can try this, it checks only the numbers and letters before CAS and HC, but you can easily modify it to after also:
print(df[~df['Desc'].str.contains('([0-9a-zA-Z]+CAS*)|([0-9a-zA-Z]+HC*)', regex=True)])
ID Desc
1 2 HC:83*20
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
Upvotes: 0