Reputation: 1093
I have a data frame df with a column name - Company. Few examples of the company names are: ABC Inc., XYZ Gmbh, PQR Ltd, JKL Limited etc. I want a list of all the suffixes (Inc.,Gmbh, Ltd., Limited etc). Please notice that suffix length is always different. There might be companies without any suffix, for example: Apple. I need a complete list of all suffixes from the all the company names, keeping only unique suffixes in the list.
How do I accomplish this task?
Upvotes: 2
Views: 3173
Reputation: 17319
You can use cleanco Python library for that, it has a list of all possible suffixes inside. E.g. it contains all the examples you provided (Inc, Gmbh, Ltd, Limited).
So you can take the suffixes from the library and use them as a dictionary to search in your data, e.g.:
import pandas as pd
company_names = pd.Series(["Apple", "ABS LLC", "Animusoft Corp", "A GMBH"])
suffixes = ["llc", "corp", "abc"] # take from cleanco source code
found = [any(company_names.map(lambda x: x.lower().endswith(' ' + suffix))) for suffix in suffixes]
suffixes_found = [suffix for (suffix, suffix_found) in zip(suffixes, found) if suffix_found]
print suffixes_found # outputs ['llc', 'corp']
Upvotes: 2
Reputation: 1465
This only adds the suffixes when the company name has more than one word as you required.
company_names = ["Apple", "ABS LLC", "Animusoft Corp"]
suffixes = [name.split()[-1] for name in company_names if len(name.split()) > 1]
Now having into account that this doesn't cover the unique requirement. This doesn't cover that you can have a company named like "Be Smart" and "Smart" is not a suffix but part of the name. However this takes care of the unique requirement:
company_names = ["Apple", "ABS LLC", "Animusoft Corp", "BBC Corp"]
suffixes = []
for name in company_names:
if len(name.split()) > 1 and name.split()[-1] not in suffixes:
suffixes.append(name.split()[-1])
Upvotes: 0
Reputation: 210812
try this:
In [36]: df
Out[36]:
Company
0 Google
1 Apple Inc
2 Microsoft Inc
3 ABC Inc.
4 XYZ Gmbh
5 PQR Ltd
6 JKL Limited
In [37]: df.Company.str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
Out[37]: array(['Inc', 'Inc.', 'Gmbh', 'Ltd', 'Limited'], dtype=object)
or ignoring punctuation:
In [38]: import string
In [39]: df.Company.str.replace('['+string.punctuation+']+','')
Out[39]:
0 Google
1 Apple Inc
2 Microsoft Inc
3 ABC Inc
4 XYZ Gmbh
5 PQR Ltd
6 JKL Limited
Name: Company, dtype: object
In [40]: df.Company.str.replace('['+string.punctuation+']+','').str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
Out[40]: array(['Inc', 'Gmbh', 'Ltd', 'Limited'], dtype=object)
export result into Excel file:
data = df.Company.str.replace('['+string.punctuation+']+','').str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
res = pd.DataFrame(data, columns=['Comp_suffix'])
res.to_excel(r'/path/to/file.xlsx', index=False)
Upvotes: 2
Reputation: 60944
So you want the last word of the Company name, assuming the company has a name more than one word long?
set(name_list[-1] for name_list in map(str.split, company_names) if len(name_list) > 1)
The [-1]
gets the last word. str.split
splits on spaces. I've never used pandas, so getting company_names
might be the hard part of this.
Upvotes: 0