Split a data frame using groupby and merge the subsets into columns

Question

I have a large pandas.DataFrame that looks something like this:

test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)

id  score       name
0   -0.652909   A
1   0.100885    A
2   0.410907    A
0   0.304012    B
1   -0.198157   B
2   -0.054764   B
0   0.358484    C
1   0.616415    C
2   0.389018    C
3   1.164172    C

So the index is non-unique but is unique if I group by the column name. I would like to split the data frame into subsections by name and then assemble (by means of an outer join) the score columns into one big new data frame and change the column names of the scores to the respective group key. What I have at the moment is:

df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
    df = df.join(sub["score"], how="outer")
    df.columns.values[-1] = key

this yields the expected result:

id  A           B           C
0   -0.652909   0.304012    0.358484
1   0.100885    -0.198157   0.616415
2   0.410907    -0.054764   0.389018
3   NaN         NaN         1.164172

but seems not very pandas-ic. Is there a better way?

Edit: Based on the answers I ran some simple timings.

%%timeit
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
    df = df.join(sub["score"], how="outer")
    df.columns.values[-1] = key

100 loops, best of 3: 2.46 ms per loop

%%timeit
test.set_index([test.index, "name"]).unstack()

1000 loops, best of 3: 1.04 ms per loop

%%timeit
test.pivot_table("score", test.index, "name")

100 loops, best of 3: 2.54 ms per loop

So unstack seems the method of choice.

FooBar · Accepted Answer

The function you look for is unstack. In order for pandas to know, what to unstack for, we will first create a MultiIndex where we add the column as last index. unstack() will then unstack (by default) based on the last index layer, so we get exactly what you want:

In[152]: test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
In[153]: test
Out[153]: 
      score name
0 -0.208392    A
1 -0.103659    A
2  1.645287    A
0  0.119709    B
1 -0.047639    B
2 -0.479155    B
0 -0.415372    C
1 -1.390416    C
2 -0.384158    C
3 -1.328278    C
In[154]: test.set_index([index, 'name'], inplace=True)
test.unstack()
Out[154]: 
         score                    
name         A         B         C
0    -0.208392  0.119709 -0.415372
1    -0.103659 -0.047639 -1.390416
2     1.645287 -0.479155 -0.384158
3          NaN       NaN -1.328278

Split a data frame using groupby and merge the subsets into columns

Answers (2)

Related Questions