ComplexData
ComplexData

Reputation: 1093

Getting a list of suffixes from the company names

I have a data frame df with a column name - Company. Few examples of the company names are: ABC Inc., XYZ Gmbh, PQR Ltd, JKL Limited etc. I want a list of all the suffixes (Inc.,Gmbh, Ltd., Limited etc). Please notice that suffix length is always different. There might be companies without any suffix, for example: Apple. I need a complete list of all suffixes from the all the company names, keeping only unique suffixes in the list.

How do I accomplish this task?

Upvotes: 2

Views: 3173

Answers (4)

Dennis Golomazov
Dennis Golomazov

Reputation: 17319

You can use cleanco Python library for that, it has a list of all possible suffixes inside. E.g. it contains all the examples you provided (Inc, Gmbh, Ltd, Limited).

So you can take the suffixes from the library and use them as a dictionary to search in your data, e.g.:

import pandas as pd
company_names = pd.Series(["Apple", "ABS LLC", "Animusoft Corp", "A GMBH"])
suffixes = ["llc", "corp", "abc"]  # take from cleanco source code
found = [any(company_names.map(lambda x: x.lower().endswith(' ' + suffix))) for suffix in suffixes]
suffixes_found = [suffix for (suffix, suffix_found) in zip(suffixes, found) if suffix_found]
print suffixes_found  # outputs ['llc', 'corp']

Upvotes: 2

epinal
epinal

Reputation: 1465

This only adds the suffixes when the company name has more than one word as you required.

company_names = ["Apple", "ABS LLC", "Animusoft Corp"]
suffixes = [name.split()[-1] for name in company_names if len(name.split()) > 1]

Now having into account that this doesn't cover the unique requirement. This doesn't cover that you can have a company named like "Be Smart" and "Smart" is not a suffix but part of the name. However this takes care of the unique requirement:

company_names = ["Apple", "ABS LLC", "Animusoft Corp", "BBC Corp"]
suffixes = []
for name in company_names:
    if len(name.split()) > 1 and name.split()[-1] not in suffixes:      
        suffixes.append(name.split()[-1])

Upvotes: 0

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210812

try this:

In [36]: df
Out[36]:
         Company
0         Google
1      Apple Inc
2  Microsoft Inc
3       ABC Inc.
4       XYZ Gmbh
5        PQR Ltd
6    JKL Limited

In [37]: df.Company.str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
Out[37]: array(['Inc', 'Inc.', 'Gmbh', 'Ltd', 'Limited'], dtype=object)

or ignoring punctuation:

In [38]: import string

In [39]: df.Company.str.replace('['+string.punctuation+']+','')
Out[39]:
0           Google
1        Apple Inc
2    Microsoft Inc
3          ABC Inc
4         XYZ Gmbh
5          PQR Ltd
6      JKL Limited
Name: Company, dtype: object

In [40]: df.Company.str.replace('['+string.punctuation+']+','').str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
Out[40]: array(['Inc', 'Gmbh', 'Ltd', 'Limited'], dtype=object)

export result into Excel file:

data = df.Company.str.replace('['+string.punctuation+']+','').str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
res = pd.DataFrame(data, columns=['Comp_suffix'])
res.to_excel(r'/path/to/file.xlsx', index=False)

Upvotes: 2

Patrick Haugh
Patrick Haugh

Reputation: 60944

So you want the last word of the Company name, assuming the company has a name more than one word long?

set(name_list[-1] for name_list in map(str.split, company_names) if len(name_list) > 1)

The [-1] gets the last word. str.split splits on spaces. I've never used pandas, so getting company_names might be the hard part of this.

Upvotes: 0

Related Questions