Reputation: 69
I have a few thousand common words such as LLC, INC, CO that I need to remove from the end of a few million company names in a pandas dataframe column. The following removes the common words in any position:
toexlude = dfwcomwords['ending'].tolist()
data['names'] = data['names'].apply(lambda x: ' '.join([word for word in x.split() if word not in (toexclude)]))
But I only want to remove the words from the end of the name, i.e. "INC INTERNATIONAL LLC" should be "INC INTERNATIONAL". (The above makes it "INTERNATIONAL".) Any help would be much appreciated.
Edit: Following @ba_ul suggestion below, I receive unbalanced parenthesis error
for word in toexclude:
data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))
Traceback (most recent call last):
File "<ipython-input-139-c68049bc0f0d>", line 2, in <module>
data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))
File "/anaconda3/envs/pandas/lib/python3.7/site-packages/pandas/core/series.py", line 4042, in apply
mapped = lib.map_infer(values, f, convert=convert_dtype)
File "pandas/_libs/lib.pyx", line 2228, in pandas._libs.lib.map_infer
File "<ipython-input-139-c68049bc0f0d>", line 2, in <lambda>
data['names'] = data['names'].apply(lambda x: re.sub(rf'{word}$', '', x, flags=re.IGNORECASE))
File "/anaconda3/envs/pandas/lib/python3.7/re.py", line 192, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/anaconda3/envs/pandas/lib/python3.7/re.py", line 286, in _compile
p = sre_compile.compile(pattern, flags)
File "/anaconda3/envs/pandas/lib/python3.7/sre_compile.py", line 764, in compile
p = sre_parse.parse(p, flags)
File "/anaconda3/envs/pandas/lib/python3.7/sre_parse.py", line 944, in parse
raise source.error("unbalanced parenthesis")
error: unbalanced parenthesis
Upvotes: 0
Views: 2429
Reputation: 2209
You can check word
for two conditions: (1) whether it's in toexclude
and (2) whether it's the last word in the company name.
toexlude = dfwcomwords['ending'].tolist()
def remove_suffix(x):
x_list = x.split()
return ' '.join([word for index, word in enumerate(x_list) if not (word in toexclude and index == len(x_list) - 1)])
data['names'] = data['names'].apply(remove_suffix)
Edit: For suffixes containing spaces, you can remove them first by using regex and the str.replace
function of pandas.
data['names'] = data['names'].str.replace('S. A. R. L.$', '')
# If you have multiple such unusual suffixes, you can chain all of them together
data['names'] = data['names'].str.replace('S. A. R. L.$', '').str.replace('L L C$', '')
$
in the regex ensures you remove only the occurrences that are at the end of a name.
Edit #2: Based on the new comments, a pure regex solution would probably be best. It's just three lines and should cover all cases.
import re
for word in toexclude:
data['names'] = data['names'].apply(lambda x: re.sub(r'\b{}$'.format(re.escape(word)), '', x, flags=re.IGNORECASE))
Upvotes: 1
Reputation: 3465
Change the check as follows:
data['names'] = data['names'].apply(
lambda x: ' '.join([word for i, word in enumerate(x.split()) if not (
i == len(x.split()) - 1 and word in toexclude)]))
Upvotes: 1