Passive_coder
Passive_coder

Reputation: 103

Remove all alphanumeric words from a string using pandas

I have a pandas dataframe column with strings that look like

'2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC'

and I want to convert it to look like

'ABACUS LABS, INC'.

My piece code :

list1 = data_df['Vendor'].str.split()
print(list1)
excludeList = list()
for y in list1:
    if (any([x for x in y if x.isalpha()]) and any([x for x in y if x.isdigit()])) :
      excludeList.append(y)
    if y.isdigit() or len(y) == 1:
      excludeList.append(y)
resList = [x for x in list1 if x not in excludeList]
print(restList)

It however gives me an error for

'list' object has no attribute 'isdigit'

Can anyone help me how I can remove the alphanumeric words from the string and only retain the text part in my pandas dataframe column?

Upvotes: 3

Views: 1588

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626896

You can use

data_df = pd.DataFrame({'Vendor': ['2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC', 'abc123 CAT LABS, INC']})
data_df['Vendor'].str.replace(r'^(?:[A-Za-z-]+\d|[\d-]+[A-Za-z])[\w-]*\s*', '', regex=True)
# => 0    ABACUS LABS, INC
#    1       CAT LABS, INC
#    Name: Vendor, dtype: object

See the regex demo.

Regex details

  • ^ - start of string
  • (?:[A-Za-z-]+\d|[\d-]+[A-Za-z]) - either one or more letters/dashes and then a digit or a one or more digits/dashes and then a letter
  • [\w-]* - zero or more word or - chars
  • \s* - zero or more whitespace chars.

Upvotes: 2

SeaBean
SeaBean

Reputation: 23217

Assuming you have column Text in dataframe df, you can try:

df2 = df['Text'].str.split().explode()
m = df2.str.contains(r'[A-Za-z]') & df2.str.contains(r'\d')
df_out = df2[~m].groupby(level=0).agg(' '.join)
df_out = df_out.to_frame(name='Text')

Explanation

We split the text into separate words then explode the list of words into multiple rows with one word in one row. Then we test whether the word contains any alpha character(s) and digit(s) by regex by using .str.contains() as follows:

.str.contains(r'[A-Za-z]')     # test any character in [A-Za-z] in string

and

.str.contains(r'\d')           # test any numeric digit in string

Then with the boolean mask m of the alpha and digit tests, we select only those row entries that does not contains both the alpha and digits by:

df2[~m]

Then, we assemble the filtered words (without alphanumeric words) back to a sentence by using

groupby(level=0).agg(' '.join)

Here, we group by level=0 which is the original row index before explode (i.e. original row number).


Demo

data = {'Text': ['2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC', 'abc123 CAT LABS, INC']}
df = pd.DataFrame(data)

                                     Text
0  2fvRE-Ku89lkRVJ44QQFN ABACUS LABS, INC
1                    abc123 CAT LABS, INC


df2 = df['Text'].str.split().explode()
m = df2.str.contains(r'[A-Za-z]') & df2.str.contains(r'\d') 
df_out = df2[~m].groupby(level=0).agg(' '.join)
df_out = df_out.to_frame(name='Text')


               Text
0  ABACUS LABS, INC
1     CAT LABS, INC

Edit

We can also simplify it as:

df2 = df['Text'].str.findall(r'\b(?!.*[A-Za-z]+.*\d+)(?!.*\d+.*[A-Za-z]+.*).+\b').str.join(' ').str.strip()

Explanation

Here the regex we use is still to adhere to the requirement of excluding alphanumeric word. Regex:

r'\b(?!.*[A-Za-z]+.*\d+)(?!.*\d+.*[A-Za-z]+.*).+\b'

Within the word boundary \b .... \b, we use 2 negative lookaheads to check against both alpha and numeric characters. We need 2 negative lookaheads instead of one because it is possible the alpha can appear before the digit or vice versa.

Upvotes: 2

sophros
sophros

Reputation: 16660

You can use regular expressions to ensure quick and elegant solution:

df2 = df['Text'].str.findall(r'((?<=\s)[a-zA-Z,]+(?=\s|$))').agg(' '.join)

Let's break it down:

  1. Regular expression that picks up only the words without digits.
  2. Per each value of df['Text'] extract list of matches of the regex.
  3. Aggregate each list using ' '.join function that concatenates the values in the list adding space in-between.

The regex is doing this:

  • to catch only "words" that are either at the beginning/end of the string there have to be used non-capturing lookbehind and lookaheads (before and after the letter catching group respectively).
  • lookahead will also stop at end of the string (instead of any white char).
  • the characters accepted in the "words" are defined as [a-zA-Z,] which allows letters lowercase and uppercase as well as a comma.

Performance

Comparing with @SeaBean solution the time difference on my machine is notable (per 2 million records dataframe):

  • mine: 6.6522 s
  • SeaBean's: 25.1773 s (3.79x slower)

There is also smaller memory impact of my solution vs. SeaBean's as he is creating additional temporary dataframe.

Upvotes: 3

Related Questions