maln0ir
maln0ir

Reputation: 589

How to "partially transpose" dataframe in Pandas?

I have csv file like this:

A,B,C,X
a,a,a,1.0
a,a,a,2.1
a,b,b,1.2
a,b,b,2.4
a,b,b,3.6
b,c,c,1.1
b,c,d,1.0

(A, B, C) is a "primary key" in this dataset, that means this set of columns should be unique. What I need to do is to find duplicates and present associated values (X column) in separate columns, like this:

A,B,C,X1,X2,X3
a,a,a,1.0,2.1,
a,b,b,1.2,2.4,3.6

I somehow know how to find duplicates and aggregate X values into tuples:

df = data.groupby(['A', 'B', 'C']).filter(lambda group: len(group) > 1).groupby(['A', 'B', 'C']).aggregate(tuple)

This is basically what I need, but I struggle with transforming it further.

I don't know how many duplicates for a given key I have in my data, so I need to find some max value and compute columns:

df['items'] = df['X'].apply(lambda x: len(x))
columns = [f'x_{i}' for i in range(1, df['X'].max() + 1)]

and then create new dataframe with new columns:

df2 = pd.DataFrame(df['RATE'].tolist(), columns=columns)

But at this point I lost index :shrug:

This page on Pandas docs suggests I should use something like this:

df.pivot(columns=columns, values=['X'])

because df already contains an index, but I get this (confusing) error:

KeyError: "None of [Index(['x_1', 'x_2'], dtype='object')] are in the [columns]"

What am I missing here?

Upvotes: 1

Views: 217

Answers (1)

Quang Hoang
Quang Hoang

Reputation: 150825

I originally marked this as a duplicate of the infamous, but since this is a bit different, here's an answer:

(df.assign(col=df.groupby(['A','B','C']).cumcount().add(1))
  .pivot_table(index=['A','B','C'], columns='col', values='X')
  .add_prefix('X')
  .reset_index()
)

Output:

col  A  B  C   X1   X2   X3
0    a  a  a  1.0  2.1  NaN
1    a  b  b  1.2  2.4  3.6
2    b  c  c  1.1  NaN  NaN
3    b  c  d  1.0  NaN  NaN

Note: this only differs to the linked question/answer in that you groupby/pivot on a set of columns, instead of one column.

Upvotes: 1

Related Questions