Reputation: 2948
In the example below I need to populate the 'Parent' column as follows: All of the column values would be CISCO except for rows 0 and 7 (should be left blank).
Note that 'CISCO' "is in" the cell below it 'CISCO System' which "is in" the cell below it 'CISCO Systems' etc. in fact..all of the CISCOs start with 'CISCO' so I need to group all of the cells that have the same start together as one entity and label the parent with the starting cell (CISCO).
We have multiple names for the same vendor so I'm trying to map all of those child 'CISCOs' to one parent 'CISCO'
Please note that I have 100,000 rows so the algorithm must be done automatically without manual intervention (i.e. not simply by hard coding parents = 'CISCO')
df = pd.DataFrame(['MICROSOFT','CISCO', 'CISCO System', 'CISCO Systems', 'CISCO Systems CANADA', 'CISCO Systems CANADA Corporation', 'CISCO Systems CANADA Corporation Limited', 'IBM'], columns=['Child']) #,[]], columns=['Child', 'Parent'])
df['Parent'] = ''
df
I was hoping that there's an elegant solution, preferably without needing loops. Many thanks for your help!
Required output:
Upvotes: 1
Views: 244
Reputation: 2948
For future reference, I've managed to come out with a very elegant and simple solution that words exactly as I want:
import pandas as pd
df = pd.DataFrame(['MICROSOFT', 'MICROSOFT CORP','CISCO', 'CISCO System', 'CISCO Systems', 'CISCO Systems CANADA', 'CISCO Systems CANADA Corporation', 'CISCO Systems CANADA Corporation Limited', 'IBM', 'Apple','Apple Corp'], columns=['Child'])
df['Parent'] = ''
c, p = df['Child'], df['Parent']
for y in range (1, df.shape[0]):
if c.iat[y-1] in c.iat[y]: p.iat[y] = np.nan #identify children
if str(p.iat[y]) == 'nan' and p.iat[y-1] == '' : p.iat[y-1] = c.iat[y-1] #identify parent
df['Parent'] = df['Parent'].ffill(axis = 0) #fill children
display(df)
Output:
Upvotes: 0
Reputation: 26676
This is a curly one. My attempt again;
Data
df = pd.DataFrame({'Child':['CANADA MOTOR','CANADA COMPUTERS', 'CANADA COMPUTERS CORPORATION', 'CANADA COMPUTERS CORPORATION LTD', 'CANADA SUPPLIES', 'CANADA SUPPLIES CORPORATION', 'CANADA SUPPLIES CORPORATION LTD', 'IBM','MICROSOFT','CISCO', 'CISCO System', 'CISCO Systems', 'CISCO Systems CANADA', 'CISCO Systems CANADA Corporation', 'CISCO Systems CANADA Corporation Limited', 'IBM']})
Extract first name for each Child into FirstCompanyName
df['FirstCompanyName']=df.Child.str.extract('(^\w+)')
Extract First and Second Names for each child into df2
, drop those without second name and rename columns to Child
and SeconCompanyName
df2=df.Child.str.extract('(^((?:\S+\s+){1}\S+).*)', expand=True).dropna()
df2.columns=['Child','SeconCompanyName']
Merge the 2 dataframes, replace any NaNs
and drop unwanted columns
df3= pd.merge(df, df2, left_index=True, right_index=True, how='left',suffixes=('', '_New'))
#df3.fillna('', inplace=True)#
df3.drop(columns=['Child_New'], inplace=True)
df3
mask where SeconCompanyName
is null
m=df3.SeconCompanyName.isna()
Replace SeconCompanyName
with FirstCompanyName
while the mask is still on
df3.loc[m,'SeconCompanyName']=df3.loc[m,'FirstCompanyName']
df3
Outcome 1
If you dont like the above skip the mask and do the following;
df3['SeconCompanyName']=np.where(df3.SeconCompanyName.isna(), df3.shift(-1).SeconCompanyName, df3.SeconCompanyName)
df3.fillna('', inplace=True)
df3
Outcome 2
Upvotes: 1
Reputation: 23099
You could split each Child
column by its white space and take the top n occurances as your pattern to extract using str.extractall
Naturally, you'll need to tweak the logic to your use-case.
s = df['Child'].str.split(' ',expand=True).stack().value_counts()
pat = '|'.join(s[s.gt(2)].index)
print(pat)
#'CISCO|Systems|CANADA'
df['Parent?'] = df['Child'].str.extractall(f'({pat})').groupby(level=0).agg(','.join)
print(df)
Child Parent Parent?
0 MICROSOFT NaN
1 CISCO CISCO
2 CISCO System CISCO
3 CISCO Systems CISCO,Systems
4 CISCO Systems CANADA CISCO,Systems,CANADA
5 CISCO Systems CANADA Corporation CISCO,Systems,CANADA
6 CISCO Systems CANADA Corporation Limited CISCO,Systems,CANADA
7 IBM NaN
Upvotes: 1