pandini
pandini

Reputation: 69

Python: How to remove a list of common words from company names?

I have a few thousand common words such as LLC, INC, CO that I need to remove from the end of a few million company names in a pandas dataframe column. The following removes the common words in any position:

toexlude = dfwcomwords['ending'].tolist()

data['names'] = data['names'].apply(lambda x: ' '.join([word for word in x.split() if word not in (toexclude)]))

But I only want to remove the words from the end of the name, i.e. "INC INTERNATIONAL LLC" should be "INC INTERNATIONAL". (The above makes it "INTERNATIONAL".) Any help would be much appreciated.

Edit: Following @ba_ul suggestion below, I receive unbalanced parenthesis error

for word in toexclude:
    data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))

Traceback (most recent call last):

  File "<ipython-input-139-c68049bc0f0d>", line 2, in <module>
    data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))

  File "/anaconda3/envs/pandas/lib/python3.7/site-packages/pandas/core/series.py", line 4042, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)

  File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer

  File "<ipython-input-139-c68049bc0f0d>", line 2, in <lambda>
    data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))

  File "/anaconda3/envs/pandas/lib/python3.7/re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)

  File "/anaconda3/envs/pandas/lib/python3.7/re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)

  File "/anaconda3/envs/pandas/lib/python3.7/sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)

  File "/anaconda3/envs/pandas/lib/python3.7/sre_parse.py", line 944, in parse
    raise source.error("unbalanced parenthesis")

error: unbalanced parenthesis

Upvotes: 0

Views: 2429

Answers (2)

ba_ul
ba_ul

Reputation: 2209

You can check word for two conditions: (1) whether it's in toexclude and (2) whether it's the last word in the company name.

toexlude = dfwcomwords['ending'].tolist()

def remove_suffix(x):
    x_list = x.split()
    return ' '.join([word for index, word in enumerate(x_list) if not (word in toexclude and index == len(x_list) - 1)])

data['names'] = data['names'].apply(remove_suffix)

Edit: For suffixes containing spaces, you can remove them first by using regex and the str.replace function of pandas.

data['names'] = data['names'].str.replace('S. A. R. L.$', '')

# If you have multiple such unusual suffixes, you can chain all of them together
data['names'] = data['names'].str.replace('S. A. R. L.$', '').str.replace('L L C$', '')

$ in the regex ensures you remove only the occurrences that are at the end of a name.

Edit #2: Based on the new comments, a pure regex solution would probably be best. It's just three lines and should cover all cases.

import re

for word in toexclude:
    data['names'] = data['names'].apply(lambda x: re.sub(r'\b{}$'.format(re.escape(word)), '', x, flags=re.IGNORECASE))

Upvotes: 1

Bruno Vermeulen
Bruno Vermeulen

Reputation: 3465

Change the check as follows:

data['names'] = data['names'].apply(
    lambda x: ' '.join([word for i, word in enumerate(x.split()) if not (
        i == len(x.split()) - 1 and word in toexclude)]))

Upvotes: 1

Related Questions