Reputation: 10058
I have 2 datasets in CSV file, using pandas each file is converted into 2 different dataframes.
I want to find similar companies based on their url. I'm able to find similar companies based on 1 field (Rule1), but I want to compare more efficiently as following:
Dataset 1
uuid, company_name, website
YAHOO,Yahoo,yahoo.com
CSCO,Cisco,cisco.com
APPL,Apple,
Dataset 2
company_name, company_website, support_website, privacy_website
Yahoo,,yahoo.com,yahoo.com
Google,google.com,,
Cisco,,,cisco.com
Result Dataset
company_name, company_website, support_website, privacy_website, uuid
Yahoo,,yahoo.com,yahoo.com,YAHOO
Google,google.com,,
Cisco,,,cisco.com,CSCO
Rules
If field website in dataset 1 is the same as field company_website in dataset 2, extract identifier.
If not match, check if field website in dataset 1 is the same as field support_website in dataset 2, extract identifier.
If not match, check if field website in dataset 1 is the same as field privacy_website in dataset 2, extract identifier.
If not match, check if field company_name in dataset 1 is the same as field company_name in dataset 2, extract identifier.
If not matches return record and identifier field (UUID) will be empty.
Here is my current function:
def MatchCompanies(
companies: pandas.Dataframe,
competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
"""Find Competitor companies in companies dataframe and generate a new list.
Args:
companies: A dataframe with company information from CSV file.
competitor_companies: A dataframe with Competitor information from CSV file.
Returns:
A sequence of matched companies and their UUID.
Raises:
ValueError: No companies found.
"""
if _IsEmpty(companies):
raise ValueError('No companies found')
# Clean up empty fields. Use extra space to avoid matching on empty TLD.
companies.fillna({'website': ' '}, inplace=True)
competitor_companies = competitor_companies.fillna('')
logging.info('Found: %d records.', len(competitor_companies))
# Rename column to TLD to compare matching companies.
companies.rename(columns={'website': 'tld'}, inplace=True)
logging.info('Cleaning up company name.')
companies.company_name = companies.company_name.apply(_NormalizeText)
competitor_companies.company_name = competitor_companies.company_name.apply(
_NormalizeText)
# Rename column to TLD since Competitor already contains TLD in company_website.
competitor_companies.rename(columns={'company_website': 'tld'}, inplace=True)
logging.info('Extracting UUID')
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
competitor_companies = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
merge_tld['uuid'])
match_companies = len(
competitor_companies[competitor_companies['uuid'].notnull()])
total_companies = len(competitor_companies)
logging.info('Results found: %d out of %d', match_companies, total_companies)
competitor_companies.rename(columns={'tld': 'company_website'}, inplace=True)
return competitor_companies
Looking for advise in which function to use?
Upvotes: 2
Views: 603
Reputation: 862511
Use map
by Series
with combine_first
, but one requrement is necessary - always unique values in df1['website']
and df1['company_name']
:
df1 = df1.dropna()
s1 = df1.set_index('website')['uuid']
s2 = df1.set_index('company_name')['uuid']
w1 = df2['company_website'].map(s1)
w2 = df2['support_website'].map(s1)
w3 = df2['privacy_website'].map(s1)
c = df2['company_name'].map(s2)
df2['uuid'] = w1.combine_first(w2).combine_first(w3).combine_first(c)
print (df2)
company_name company_website support_website privacy_website uuid
0 Yahoo NaN yahoo.com yahoo.com YAHOO
1 Google google.com NaN NaN NaN
2 Cisco NaN NaN cisco.com CSCO
Upvotes: 2
Reputation: 862
Take a look at dataframe.merge. Rename third column in A to company_website
and do something like
A.merge(B, on='company_website', indicator=True)
should at least take care of the first rule.
Upvotes: -1