Data Manipulation in pandas - Creating multiple columns based on multiple conditions

Question

I have a dataframe that has only one column 'Desc' and it looks like this.

Desc
AB - 01
123 AB NEXT - 01
010 EMPLOYEE - 23
020 DEMAND 80
010 EMPLOYEE 45
020 DEMAND 28
AAAAA............
BBBBB.............
AB - 02
123 AB NEXT - 02
010 EMPLOYEE - 48
020 DEMAND - 87
010 EMPLOYEE - 94
020 DEMAND - 09
050 EMPLOYEE - 88
060 DEMAND - 90
BBBBBB..........
GGGGGG..........

I want to manipulate data in a way that produces the following output.

Col 1	Col 2	Col 3	Col 4	Col 5	Col 6
AB - 01	123 AB NEXT -01	010 EMPLOYEE 23	020 DEMAND 80	NULL	NULL
AB - 01	123 AB NEXT -01	010 EMPLOYEE 45	020 DEMAND 28	NULL	NULL
AB - 02	123 AB NEXT -02	010 EMPLOYEE 48	020 DEMAND 87	050 EMPLOYEE 88	060 DEMAND - 90
AB - 02	123 AB NEXT -02	010 EMPLOYEE 94	020 DEMAND 09	NULL	NULL

So basically, each AB is a broad category which has columns 010, 020, and so on. I was thinking of using this approach. The code looks for row staring with AB, create a column for that and that parse the data under it by placing each 010, 020 (all such numbers) in it's separate column until it encounters the next AB. Moreover, this is just an extract of the dataframe as it goes on with different ABs.

not_speshal · Accepted Answer

Use numpy.array_split to split the DataFrame on new "AB" rows and for each frame in the resulting list:

split the data at the first space to get the column number (010, 020, etc.)
groupby the column number and cumcount to get a unique index (row count)
pivot the data
append to output:

import numpy as np

groups = np.array_split(df, df[df["Desc"].str.contains("AB - \d{2}", regex=True)].index)

output = pd.DataFrame()
for group in groups:
    srs = group.squeeze()
    if srs.shape[0]==0:
        continue
    #first two rows are copied for all records
    split_cols = srs[2:].str.split(n=1, expand=True)
    split_cols[1] = srs.where(split_cols[0].str.isnumeric())
    split_cols = split_cols.dropna()
    split_cols["idx"] = split_cols.groupby(0)[1].transform("cumcount")
    
    temp = split_cols.pivot("idx", 0, 1)
    temp.insert(0, "Col 1", srs.iat[0])
    temp.insert(1, "Col 2", srs.iat[0])
    output = output.append(temp, ignore_index=True)

#rename columns if needed
output.columns = [f"Col {i+1}" for i in range(len(output.columns))]

>>> output

     Col 1    Col 2              Col 3            Col 4              Col 5  \
0  AB - 01  AB - 01  010 EMPLOYEE - 23    020 DEMAND 80                NaN   
1  AB - 01  AB - 01    010 EMPLOYEE 45    020 DEMAND 28                NaN   
2  AB - 02  AB - 02  010 EMPLOYEE - 48  020 DEMAND - 87  050 EMPLOYEE - 88   
3  AB - 02  AB - 02  010 EMPLOYEE - 94  020 DEMAND - 09                NaN   

             Col 6  
0              NaN  
1              NaN  
2  060 DEMAND - 90  
3              NaN

Data Manipulation in pandas - Creating multiple columns based on multiple conditions

Answers (1)

Related Questions