Reputation: 589
I have csv file like this:
A,B,C,X
a,a,a,1.0
a,a,a,2.1
a,b,b,1.2
a,b,b,2.4
a,b,b,3.6
b,c,c,1.1
b,c,d,1.0
(A, B, C)
is a "primary key" in this dataset, that means this set of columns should be unique. What I need to do is to find duplicates and present associated values (X
column) in separate columns, like this:
A,B,C,X1,X2,X3
a,a,a,1.0,2.1,
a,b,b,1.2,2.4,3.6
I somehow know how to find duplicates and aggregate X
values into tuples:
df = data.groupby(['A', 'B', 'C']).filter(lambda group: len(group) > 1).groupby(['A', 'B', 'C']).aggregate(tuple)
This is basically what I need, but I struggle with transforming it further.
I don't know how many duplicates for a given key I have in my data, so I need to find some max value and compute columns:
df['items'] = df['X'].apply(lambda x: len(x))
columns = [f'x_{i}' for i in range(1, df['X'].max() + 1)]
and then create new dataframe with new columns:
df2 = pd.DataFrame(df['RATE'].tolist(), columns=columns)
But at this point I lost index :shrug:
This page on Pandas docs suggests I should use something like this:
df.pivot(columns=columns, values=['X'])
because df
already contains an index, but I get this (confusing) error:
KeyError: "None of [Index(['x_1', 'x_2'], dtype='object')] are in the [columns]"
What am I missing here?
Upvotes: 1
Views: 217
Reputation: 150825
I originally marked this as a duplicate of the infamous, but since this is a bit different, here's an answer:
(df.assign(col=df.groupby(['A','B','C']).cumcount().add(1))
.pivot_table(index=['A','B','C'], columns='col', values='X')
.add_prefix('X')
.reset_index()
)
Output:
col A B C X1 X2 X3
0 a a a 1.0 2.1 NaN
1 a b b 1.2 2.4 3.6
2 b c c 1.1 NaN NaN
3 b c d 1.0 NaN NaN
Note: this only differs to the linked question/answer in that you groupby/pivot on a set of columns, instead of one column.
Upvotes: 1