Reputation: 37
I'm playing around with regular expression in Python for the below data.
Random
0 helloooo
1 hahaha
2 kebab
3 shsh
4 title
5 miss
6 were
7 laptop
8 welcome
9 pencil
I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).
Here is the code:
import pandas as pd
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)
print(df)
Below is the output for the above with a Warning message:
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Random
0 hahaha
1 kebab
2 shsh
3 title
4 were
5 laptop
6 welcome
7 pencil
However, I expect to see this:
Random
0 laptop
1 welcome
2 pencil
Upvotes: 1
Views: 202
Reputation: 626699
You can use Series.str.contains
directly to create a mask and disable the user warning before and enable it after:
import pandas as pd
import warnings
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning
Output:
>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>
7 laptop
8 welcome
9 pencil
Name: Random, dtype: object
The regex you have contains an issue: the quantifier is put outside of the group, and \1
was looking for the wrong repeated string. Also, the \b
word boundary is superflous. The ([a-z]+)[a-z]?\1
pattern matches for one or more letters, then any one optional letter, and the same substring right after it.
See the regex demo.
We can safely disable the user warning because we deliberately use the capturing group here, as we need to use a backreference in this regex pattern. The warning needs re-enabling to avoid using capturing groups in other parts of our code where it is not necessary.
Upvotes: 4
Reputation: 5802
IIUC, you can use sth like the pattern r'(\w+)(\w)?\1'
, i.e., one or more letters, an optional letter, and the letters from the first match. This gives the right result:
df[~df.Random.str.contains(r'(\w+)(\w)?\1')]
Upvotes: 2