Chadee Fouad
Chadee Fouad

Reputation: 2948

Pandas: Fill Cells Down Algorithm

In the example below I need to populate the 'Parent' column as follows: All of the column values would be CISCO except for rows 0 and 7 (should be left blank).

Note that 'CISCO' "is in" the cell below it 'CISCO System' which "is in" the cell below it 'CISCO Systems' etc. in fact..all of the CISCOs start with 'CISCO' so I need to group all of the cells that have the same start together as one entity and label the parent with the starting cell (CISCO).

We have multiple names for the same vendor so I'm trying to map all of those child 'CISCOs' to one parent 'CISCO'

Please note that I have 100,000 rows so the algorithm must be done automatically without manual intervention (i.e. not simply by hard coding parents = 'CISCO')

df = pd.DataFrame(['MICROSOFT','CISCO', 'CISCO System', 'CISCO Systems', 'CISCO Systems CANADA', 'CISCO Systems CANADA Corporation', 'CISCO Systems CANADA Corporation Limited', 'IBM'], columns=['Child']) #,[]], columns=['Child', 'Parent'])
df['Parent'] = ''
df

I was hoping that there's an elegant solution, preferably without needing loops. Many thanks for your help!

enter image description here

Required output:

enter image description here

Upvotes: 1

Views: 244

Answers (3)

Chadee Fouad
Chadee Fouad

Reputation: 2948

For future reference, I've managed to come out with a very elegant and simple solution that words exactly as I want:

    import pandas as pd
df = pd.DataFrame(['MICROSOFT', 'MICROSOFT CORP','CISCO', 'CISCO System', 'CISCO Systems', 'CISCO Systems CANADA', 'CISCO Systems CANADA Corporation', 'CISCO Systems CANADA Corporation Limited', 'IBM', 'Apple','Apple Corp'], columns=['Child'])
df['Parent'] = ''

c, p = df['Child'], df['Parent']
for y in range (1, df.shape[0]): 
    if c.iat[y-1] in c.iat[y]: p.iat[y] = np.nan #identify children
    if str(p.iat[y]) == 'nan' and p.iat[y-1] == '' : p.iat[y-1] = c.iat[y-1] #identify parent

df['Parent'] = df['Parent'].ffill(axis = 0) #fill children
display(df)

Output:

enter image description here

Upvotes: 0

wwnde
wwnde

Reputation: 26676

This is a curly one. My attempt again;

Data

df = pd.DataFrame({'Child':['CANADA MOTOR','CANADA COMPUTERS', 'CANADA COMPUTERS CORPORATION', 'CANADA COMPUTERS CORPORATION LTD', 'CANADA SUPPLIES', 'CANADA SUPPLIES CORPORATION', 'CANADA SUPPLIES CORPORATION LTD', 'IBM','MICROSOFT','CISCO', 'CISCO System', 'CISCO Systems', 'CISCO Systems CANADA', 'CISCO Systems CANADA Corporation', 'CISCO Systems CANADA Corporation Limited', 'IBM']})

Extract first name for each Child into FirstCompanyName

df['FirstCompanyName']=df.Child.str.extract('(^\w+)')

Extract First and Second Names for each child into df2, drop those without second name and rename columns to Child and SeconCompanyName

df2=df.Child.str.extract('(^((?:\S+\s+){1}\S+).*)', expand=True).dropna()
df2.columns=['Child','SeconCompanyName']

Merge the 2 dataframes, replace any NaNs and drop unwanted columns

   df3= pd.merge(df, df2, left_index=True, right_index=True, how='left',suffixes=('', '_New'))
#df3.fillna('', inplace=True)#
df3.drop(columns=['Child_New'], inplace=True)
df3

mask where SeconCompanyName is null

m=df3.SeconCompanyName.isna()

Replace SeconCompanyName with FirstCompanyName while the mask is still on

df3.loc[m,'SeconCompanyName']=df3.loc[m,'FirstCompanyName']
df3

Outcome 1

enter image description here

If you dont like the above skip the mask and do the following;

df3['SeconCompanyName']=np.where(df3.SeconCompanyName.isna(), df3.shift(-1).SeconCompanyName, df3.SeconCompanyName)
df3.fillna('', inplace=True)
df3

Outcome 2

enter image description here

Upvotes: 1

Umar.H
Umar.H

Reputation: 23099

You could split each Child column by its white space and take the top n occurances as your pattern to extract using str.extractall

Naturally, you'll need to tweak the logic to your use-case.

s = df['Child'].str.split(' ',expand=True).stack().value_counts()

pat = '|'.join(s[s.gt(2)].index)

print(pat)
#'CISCO|Systems|CANADA'

df['Parent?'] = df['Child'].str.extractall(f'({pat})').groupby(level=0).agg(','.join)

print(df)

                                      Child Parent               Parent?
0                                 MICROSOFT                          NaN
1                                     CISCO                        CISCO
2                              CISCO System                        CISCO
3                             CISCO Systems                CISCO,Systems
4                      CISCO Systems CANADA         CISCO,Systems,CANADA
5          CISCO Systems CANADA Corporation         CISCO,Systems,CANADA
6  CISCO Systems CANADA Corporation Limited         CISCO,Systems,CANADA
7                                       IBM                          NaN

Upvotes: 1

Related Questions