Reputation: 61
I am using Pandas to work on a dataframe that has a column with the name of companies.
For each company name, several versions of it are available. df is an example:
df = pd.DataFrame({'id':['a','b','c','d','e','f'],'company name':['name1', ' Name1 LTD', 'name1, ltd.','name 1 LT.D.',' name2 p.p.c', 'name2 ppc.']})
I was wondering if there is a simple way to find similar names and assign a unique id to them? For the example above, I would like to get something like:
dg = pd.DataFrame({'id':['a','a','a','a','e','e'],'company name':['name1', ' Name1 LTD', 'name1, ltd.','name 1 LT.D.',' name2 p.p.c', 'name2 ppc.']})
Thanks,
Upvotes: 1
Views: 2652
Reputation: 854
I feel your problem, like any programming problem, needs to be broken down into smaller pieces. I will break this down step by step as I understood it, AND as I would approach it.
Step 1. Clean(make uniform) your company name
values. Here is a good article on data
cleansing and why it
is important.
Step 2. Map your id
based on unique company names(this step is easy once step 1 is done)
import pandas as pd
df = pd.DataFrame({'id':['a','b','c','d','e','f'],'company_name':['na,me1', ' Name1 LTD', 'name1, ltd.','name 1 LT.D.',' name2 p.p.c', 'name2 ppc.']})
Step 1.
Clean it by using extract
with regex
keep in mind that below regex
only captures the small sample you provided and you may need to design the pattern to work with your full dataset
df['new_company_name'] = (df['company_name']
.str.lower() # lowercase to standardize
.str.replace(' |,','') # remove extra characters, this may vary for your full dataset
.str.extract(r'^(\w+\d{1})')) # pattern to extract the vital part of the name, this also will vary based on your data
print(df)
id company_name new_company_name
0 a name1 name1
1 b Name1 LTD name1
2 c name1, ltd. name1
3 d name 1 LT.D. name1
4 e name2 p.p.c name2
5 f name2 ppc. name2
Step 2
It is advisable to use numerical values rather than str
because of performance
Option 1 using groupby()
with ngroup()
df['new_id'] = df.groupby('new_company_name').ngroup()
Option 2 using zip()
, dict()
, and then map()
unique_names = df.new_company_name.unique()
mapper = dict(
zip(unique_names,
[name_id for name_id in range(len(unique_names))]
)
)
df['new_id'] = df['new_company_name'].map(mapper)
Same result for Option 1 or Option 2
print(df)
id company_name new_company_name new_id
0 a name1 name1 0
1 b Name1 LTD name1 0
2 c name1, ltd. name1 0
3 d name 1 LT.D. name1 0
4 e name2 p.p.c name2 1
5 f name2 ppc. name2 1
Hope this helps.
Upvotes: 1
Reputation: 11
One of the things I've done is use regex or a function that processes the raw strings that strips out all the extra like ltd and arbitrary special characters. Then create a processed string that is their "true name" and create an index of id's based on the "true name".
Or you can use fuzzywuzzy to find the distance between two strings and build a candidate set of matches and build an index of unique names based on match scores.
i.e.
def clean_str(x):
x2 = x.lower()
x2 = x2.replace('.', '')
return x2
Upvotes: 1