Reputation: 45
I have a dataframe that has only one column 'Desc' and it looks like this.
Desc |
---|
AB - 01 |
123 AB NEXT - 01 |
010 EMPLOYEE - 23 |
020 DEMAND 80 |
010 EMPLOYEE 45 |
020 DEMAND 28 |
AAAAA............ |
BBBBB............. |
AB - 02 |
123 AB NEXT - 02 |
010 EMPLOYEE - 48 |
020 DEMAND - 87 |
010 EMPLOYEE - 94 |
020 DEMAND - 09 |
050 EMPLOYEE - 88 |
060 DEMAND - 90 |
BBBBBB.......... |
GGGGGG.......... |
I want to manipulate data in a way that produces the following output.
Col 1 | Col 2 | Col 3 | Col 4 | Col 5 | Col 6 |
---|---|---|---|---|---|
AB - 01 | 123 AB NEXT -01 | 010 EMPLOYEE 23 | 020 DEMAND 80 | NULL | NULL |
AB - 01 | 123 AB NEXT -01 | 010 EMPLOYEE 45 | 020 DEMAND 28 | NULL | NULL |
AB - 02 | 123 AB NEXT -02 | 010 EMPLOYEE 48 | 020 DEMAND 87 | 050 EMPLOYEE 88 | 060 DEMAND - 90 |
AB - 02 | 123 AB NEXT -02 | 010 EMPLOYEE 94 | 020 DEMAND 09 | NULL | NULL |
So basically, each AB is a broad category which has columns 010, 020, and so on. I was thinking of using this approach. The code looks for row staring with AB, create a column for that and that parse the data under it by placing each 010, 020 (all such numbers) in it's separate column until it encounters the next AB. Moreover, this is just an extract of the dataframe as it goes on with different ABs.
Upvotes: 0
Views: 82
Reputation: 23146
Use numpy.array_split
to split the DataFrame on new "AB" rows and for each frame in the resulting list:
groupby
the column number and cumcount
to get a unique index (row count)pivot
the dataappend
to output
:import numpy as np
groups = np.array_split(df, df[df["Desc"].str.contains("AB - \d{2}", regex=True)].index)
output = pd.DataFrame()
for group in groups:
srs = group.squeeze()
if srs.shape[0]==0:
continue
#first two rows are copied for all records
split_cols = srs[2:].str.split(n=1, expand=True)
split_cols[1] = srs.where(split_cols[0].str.isnumeric())
split_cols = split_cols.dropna()
split_cols["idx"] = split_cols.groupby(0)[1].transform("cumcount")
temp = split_cols.pivot("idx", 0, 1)
temp.insert(0, "Col 1", srs.iat[0])
temp.insert(1, "Col 2", srs.iat[0])
output = output.append(temp, ignore_index=True)
#rename columns if needed
output.columns = [f"Col {i+1}" for i in range(len(output.columns))]
>>> output
Col 1 Col 2 Col 3 Col 4 Col 5 \
0 AB - 01 AB - 01 010 EMPLOYEE - 23 020 DEMAND 80 NaN
1 AB - 01 AB - 01 010 EMPLOYEE 45 020 DEMAND 28 NaN
2 AB - 02 AB - 02 010 EMPLOYEE - 48 020 DEMAND - 87 050 EMPLOYEE - 88
3 AB - 02 AB - 02 010 EMPLOYEE - 94 020 DEMAND - 09 NaN
Col 6
0 NaN
1 NaN
2 060 DEMAND - 90
3 NaN
Upvotes: 1