Reputation: 756
The input dataframe is in the form below:
df1 = pd.DataFrame(
{
"A": ["A0", "A0", "A0", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3])
I want to transform this table in a way that if there are multiple rows with similar values in column A, the extra rows become new columns. The expected table would be:
df2 = pd.DataFrame(
{
"A": ["A0", "A3"],
"B": ["B0", "B3"],
"C": ["C0", "C3"],
"D": ["D0", "D3"],
"new_B": ["B1", "NaN"],
"new_C": ["C1", "NaN"],
"new_D": ["D1", "NaN"],
"new_B_2": ["B2", "NaN"],
"new_C_2": ["C2", "NaN"],
"new_D_2": ["D2","NaN"],
},
index=[0, 1])
A unique value in column A might be repeated at most 5 times.
Upvotes: 3
Views: 88
Reputation: 1804
If you want to use less pandas, you can go by this
def func(df, col_names):
vals = df.values.ravel().tolist() # get all the values under the group
vals += [np.nan] * (len(col_names)-len(vals)) # populate missing vals with nans
return pd.Series(vals, index = col_names)
df1 = df1.set_index('A')
grp_by = df1.groupby(level=0)
max_grp_by_size = grp_by.size().max() # max group size for getting the new columns
col_names = df1.columns.values.tolist()
new_col_names = col_names + [f'new_{c}_{i}'
for i in range(1, max_grp_by_size) for c in col_names]
grp_by[col_names].apply(lambda x: func(x, new_col_names))
Upvotes: 0
Reputation: 71689
You can use .cumcount
to create sequential counter per column A
group, then set this counter along with column A
as MultiIndex, followed by .stack
+ .unstack
to reshape
, finally flatten the columns using list comprehension:
df2 = df1.set_index([df1.groupby('A').cumcount(), 'A']).stack().unstack([-1, 0])
df2.columns = [x if y == 0 else f'{x}_{y}' for x, y in df2]
Simplified version of above code if order & formatting of columns does not matter:
df2 = df1.set_index([df1.groupby('A').cumcount().astype(str), 'A']).unstack(0)
df2.columns = df2.columns.map('_'.join)
B C D B_1 C_1 D_1 B_2 C_2 D_2
A
A0 B0 C0 D0 B1 C1 D1 B2 C2 D2
A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN
Upvotes: 4
Reputation: 75080
This is pivot with naming logic for columns which can be handled with an if else in a list comprehension to match the original expected names.
out = (df1.set_index(['A',df1.groupby("A").cumcount()]).unstack()
.sort_index(level=1,axis=1))
out.columns = [a if b ==0 else f"new_{a}" if b==1 else f"new_{a}_{b}"
for a,b in out.columns]
out = out.reset_index()
print(out)
A B C D new_B new_C new_D new_B_2 new_C_2 new_D_2
0 A0 B0 C0 D0 B1 C1 D1 B2 C2 D2
1 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN
Upvotes: 4