Delete words with regex patterns in Python from a dataframe

Question

I'm playing around with regular expression in Python for the below data.

     Random
0  helloooo
1    hahaha
2     kebab
3      shsh
4     title
5      miss
6      were
7    laptop
8   welcome
9    pencil

I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).

Here is the code:

import pandas as pd

data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}

df = pd.DataFrame(data)

df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)

print(df)

Below is the output for the above with a Warning message:

UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
    Random
0   hahaha
1    kebab
2     shsh
3    title
4     were
5   laptop
6  welcome
7   pencil

However, I expect to see this:

    Random
0   laptop
1  welcome
2   pencil

Wiktor Stribiżew · Accepted Answer

You can use Series.str.contains directly to create a mask and disable the user warning before and enable it after:

import pandas as pd
import warnings

data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning

Output:

>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>     
7     laptop
8    welcome
9     pencil
Name: Random, dtype: object

The regex you have contains an issue: the quantifier is put outside of the group, and \1 was looking for the wrong repeated string. Also, the \b word boundary is superflous. The ([a-z]+)[a-z]?\1 pattern matches for one or more letters, then any one optional letter, and the same substring right after it.

See the regex demo.

We can safely disable the user warning because we deliberately use the capturing group here, as we need to use a backreference in this regex pattern. The warning needs re-enabling to avoid using capturing groups in other parts of our code where it is not necessary.

Delete words with regex patterns in Python from a dataframe

Answers (2)

Related Questions