Slouei
Slouei

Reputation: 756

How to treat some rows as new columns using pandas

The input dataframe is in the form below:

df1 = pd.DataFrame(
    {
        "A": ["A0", "A0", "A0", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3])

I want to transform this table in a way that if there are multiple rows with similar values in column A, the extra rows become new columns. The expected table would be:

df2 = pd.DataFrame(
    {
        "A": ["A0",  "A3"],
        "B": ["B0",  "B3"],
        "C": ["C0",  "C3"],
        "D": ["D0",   "D3"],
        "new_B": ["B1",   "NaN"],
        "new_C": ["C1",   "NaN"],
        "new_D": ["D1", "NaN"],
        "new_B_2": ["B2",   "NaN"],
        "new_C_2": ["C2",   "NaN"],
        "new_D_2": ["D2","NaN"],
    },
    index=[0, 1])

A unique value in column A might be repeated at most 5 times. enter image description here

Upvotes: 3

Views: 88

Answers (3)

ggaurav
ggaurav

Reputation: 1804

If you want to use less pandas, you can go by this

def func(df, col_names):
    vals = df.values.ravel().tolist() # get all the values under the group
    vals += [np.nan] * (len(col_names)-len(vals)) # populate missing vals with nans
    return pd.Series(vals, index = col_names)

df1 = df1.set_index('A')

grp_by = df1.groupby(level=0)
max_grp_by_size = grp_by.size().max() # max group size for getting the new columns
col_names = df1.columns.values.tolist()
new_col_names = col_names + [f'new_{c}_{i}'
                             for i in range(1, max_grp_by_size) for c in col_names]

grp_by[col_names].apply(lambda x: func(x, new_col_names))

enter image description here

Upvotes: 0

Shubham Sharma
Shubham Sharma

Reputation: 71689

You can use .cumcount to create sequential counter per column A group, then set this counter along with column A as MultiIndex, followed by .stack + .unstack to reshape, finally flatten the columns using list comprehension:

df2 = df1.set_index([df1.groupby('A').cumcount(), 'A']).stack().unstack([-1, 0])
df2.columns = [x if y == 0 else f'{x}_{y}' for x, y in df2]

Simplified version of above code if order & formatting of columns does not matter:

df2 = df1.set_index([df1.groupby('A').cumcount().astype(str), 'A']).unstack(0)
df2.columns = df2.columns.map('_'.join)

     B   C   D  B_1  C_1  D_1  B_2  C_2  D_2
A                                           
A0  B0  C0  D0   B1   C1   D1   B2   C2   D2
A3  B3  C3  D3  NaN  NaN  NaN  NaN  NaN  NaN

Upvotes: 4

anky
anky

Reputation: 75080

This is pivot with naming logic for columns which can be handled with an if else in a list comprehension to match the original expected names.

out = (df1.set_index(['A',df1.groupby("A").cumcount()]).unstack()
          .sort_index(level=1,axis=1))

out.columns = [a if b ==0 else f"new_{a}" if b==1 else f"new_{a}_{b}" 
               for a,b in out.columns]
out = out.reset_index()

print(out)

    A   B   C   D new_B new_C new_D new_B_2 new_C_2 new_D_2
0  A0  B0  C0  D0    B1    C1    D1      B2      C2      D2
1  A3  B3  C3  D3   NaN   NaN   NaN     NaN     NaN     NaN

Upvotes: 4

Related Questions