Python: How to remove a list of common words from company names?

Question

I have a few thousand common words such as LLC, INC, CO that I need to remove from the end of a few million company names in a pandas dataframe column. The following removes the common words in any position:

toexlude = dfwcomwords['ending'].tolist()

data['names'] = data['names'].apply(lambda x: ' '.join([word for word in x.split() if word not in (toexclude)]))

But I only want to remove the words from the end of the name, i.e. "INC INTERNATIONAL LLC" should be "INC INTERNATIONAL". (The above makes it "INTERNATIONAL".) Any help would be much appreciated.

Edit: Following @ba_ul suggestion below, I receive unbalanced parenthesis error

for word in toexclude:
    data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))

Traceback (most recent call last):

  File "", line 2, in 
    data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))

  File "/anaconda3/envs/pandas/lib/python3.7/site-packages/pandas/core/series.py", line 4042, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)

  File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer

  File "", line 2, in 
    data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))

  File "/anaconda3/envs/pandas/lib/python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)

  File "/anaconda3/envs/pandas/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)

  File "/anaconda3/envs/pandas/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)

  File "/anaconda3/envs/pandas/lib/python3.7/sre_parse.py", line 944, in parse
    raise source.error("unbalanced parenthesis")

error: unbalanced parenthesis

ba_ul · Accepted Answer

You can check word for two conditions: (1) whether it's in toexclude and (2) whether it's the last word in the company name.

toexlude = dfwcomwords['ending'].tolist()

def remove_suffix(x):
    x_list = x.split()
    return ' '.join([word for index, word in enumerate(x_list) if not (word in toexclude and index == len(x_list) - 1)])

data['names'] = data['names'].apply(remove_suffix)

Edit: For suffixes containing spaces, you can remove them first by using regex and the str.replace function of pandas.

data['names'] = data['names'].str.replace('S. A. R. L.$', '')

# If you have multiple such unusual suffixes, you can chain all of them together
data['names'] = data['names'].str.replace('S. A. R. L.$', '').str.replace('L L C$', '')

$ in the regex ensures you remove only the occurrences that are at the end of a name.

Edit #2: Based on the new comments, a pure regex solution would probably be best. It's just three lines and should cover all cases.

import re

for word in toexclude:
    data['names'] = data['names'].apply(lambda x: re.sub(r'\b{}$'.format(re.escape(word)), '', x, flags=re.IGNORECASE))

Python: How to remove a list of common words from company names?

Answers (2)

Related Questions