Jack Florey
Jack Florey

Reputation: 87

Pandas: "distribute" column values into multiple rows

I am a pandas newbie, and I am trying to solve the following problem.

I have a large DataFrame (10000 x 28) as follows.

Col1 Col2 Col3 Col4 Col5
  A    B   C    D    E

How can I reshape it to look like the following?

Col1 Col2 NewColName
  A    B    C
  A    B    D
  A    B    E

The first two columns remain the same, and the last three columns get "distributed" over them.

My goal is to take a large DataFrame and turn it into "tidy data"

I have tried methods like concat, stack etc. but I feel I'm missing some elegant way to solve this.

Upvotes: 2

Views: 924

Answers (2)

Stefan
Stefan

Reputation: 42905

Set the ['Col1, 'Col2] as index and use .stack().

df.set_index(['Col1', 'Col2']).stack()

Col1  Col2   
A     B     0    C
            0    D
            0    E

Then do .reset_index() to format as in your example (you can also add name='Col' for the same result as suggested by @jezrael:

df.reset_index(-1, drop=True).reset_index(name='Col')

  Col1 Col2  0
0    A    B  C
1    A    B  D
2    A    B  E

Upvotes: 0

jezrael
jezrael

Reputation: 863166

You can use melt and drop:

print pd.melt(df, id_vars=['Col1','Col2'],value_name='Col').drop('variable', axis=1)
  Col1 Col2 Col
0    A    B   C
1    A    B   D
2    A    B   E

Timings:

df = pd.concat([df]*1000).reset_index(drop=True)

In [58]: %timeit pd.melt(df, id_vars=['Col1','Col2'],value_name='Col').drop('variable', axis=1)
100 loops, best of 3: 2.48 ms per loop

In [59]: %timeit df.set_index(['Col1', 'Col2']).stack().reset_index(-1, drop=True).reset_index(name='Col')
100 loops, best of 3: 3.83 ms per loop

Upvotes: 0

Related Questions