ABC
ABC

Reputation: 45

Data Manipulation in pandas - Creating multiple columns based on multiple conditions

I have a dataframe that has only one column 'Desc' and it looks like this.

Desc
AB - 01
123 AB NEXT - 01
010 EMPLOYEE - 23
020 DEMAND 80
010 EMPLOYEE 45
020 DEMAND 28
AAAAA............
BBBBB.............
AB - 02
123 AB NEXT - 02
010 EMPLOYEE - 48
020 DEMAND - 87
010 EMPLOYEE - 94
020 DEMAND - 09
050 EMPLOYEE - 88
060 DEMAND - 90
BBBBBB..........
GGGGGG..........

I want to manipulate data in a way that produces the following output.

Col 1 Col 2 Col 3 Col 4 Col 5 Col 6
AB - 01 123 AB NEXT -01 010 EMPLOYEE 23 020 DEMAND 80 NULL NULL
AB - 01 123 AB NEXT -01 010 EMPLOYEE 45 020 DEMAND 28 NULL NULL
AB - 02 123 AB NEXT -02 010 EMPLOYEE 48 020 DEMAND 87 050 EMPLOYEE 88 060 DEMAND - 90
AB - 02 123 AB NEXT -02 010 EMPLOYEE 94 020 DEMAND 09 NULL NULL

So basically, each AB is a broad category which has columns 010, 020, and so on. I was thinking of using this approach. The code looks for row staring with AB, create a column for that and that parse the data under it by placing each 010, 020 (all such numbers) in it's separate column until it encounters the next AB. Moreover, this is just an extract of the dataframe as it goes on with different ABs.

Upvotes: 0

Views: 82

Answers (1)

not_speshal
not_speshal

Reputation: 23146

Use numpy.array_split to split the DataFrame on new "AB" rows and for each frame in the resulting list:

  1. split the data at the first space to get the column number (010, 020, etc.)
  2. groupby the column number and cumcount to get a unique index (row count)
  3. pivot the data
  4. append to output:
import numpy as np

groups = np.array_split(df, df[df["Desc"].str.contains("AB - \d{2}", regex=True)].index)

output = pd.DataFrame()
for group in groups:
    srs = group.squeeze()
    if srs.shape[0]==0:
        continue
    #first two rows are copied for all records
    split_cols = srs[2:].str.split(n=1, expand=True)
    split_cols[1] = srs.where(split_cols[0].str.isnumeric())
    split_cols = split_cols.dropna()
    split_cols["idx"] = split_cols.groupby(0)[1].transform("cumcount")
    
    temp = split_cols.pivot("idx", 0, 1)
    temp.insert(0, "Col 1", srs.iat[0])
    temp.insert(1, "Col 2", srs.iat[0])
    output = output.append(temp, ignore_index=True)

#rename columns if needed
output.columns = [f"Col {i+1}" for i in range(len(output.columns))]

>>> output

     Col 1    Col 2              Col 3            Col 4              Col 5  \
0  AB - 01  AB - 01  010 EMPLOYEE - 23    020 DEMAND 80                NaN   
1  AB - 01  AB - 01    010 EMPLOYEE 45    020 DEMAND 28                NaN   
2  AB - 02  AB - 02  010 EMPLOYEE - 48  020 DEMAND - 87  050 EMPLOYEE - 88   
3  AB - 02  AB - 02  010 EMPLOYEE - 94  020 DEMAND - 09                NaN   

             Col 6  
0              NaN  
1              NaN  
2  060 DEMAND - 90  
3              NaN  

Upvotes: 1

Related Questions