Reputation: 3881
I have a large pandas.DataFrame
that looks something like this:
test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
id score name 0 -0.652909 A 1 0.100885 A 2 0.410907 A 0 0.304012 B 1 -0.198157 B 2 -0.054764 B 0 0.358484 C 1 0.616415 C 2 0.389018 C 3 1.164172 C
So the index is non-unique but is unique if I group by the column name
. I would like to split the data frame into subsections by name and then assemble (by means of an outer join) the score columns into one big new data frame and change the column names of the scores to the respective group key. What I have at the moment is:
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
df = df.join(sub["score"], how="outer")
df.columns.values[-1] = key
this yields the expected result:
id A B C 0 -0.652909 0.304012 0.358484 1 0.100885 -0.198157 0.616415 2 0.410907 -0.054764 0.389018 3 NaN NaN 1.164172
but seems not very pandas
-ic. Is there a better way?
Edit: Based on the answers I ran some simple timings.
%%timeit
df = pandas.DataFrame()
for (key, sub) in test.groupby("name"):
df = df.join(sub["score"], how="outer")
df.columns.values[-1] = key
100 loops, best of 3: 2.46 ms per loop
%%timeit
test.set_index([test.index, "name"]).unstack()
1000 loops, best of 3: 1.04 ms per loop
%%timeit
test.pivot_table("score", test.index, "name")
100 loops, best of 3: 2.54 ms per loop
So unstack
seems the method of choice.
Upvotes: 2
Views: 1490
Reputation: 4051
I recently came across a similar problem, which was solved by using a pivot_table
a = """id score name
0 -0.652909 A
1 0.100885 A
2 0.410907 A
0 0.304012 B
1 -0.198157 B
2 -0.054764 B
0 0.358484 C
1 0.616415 C
2 0.389018 C
3 1.164172 C"""
df = pd.read_csv(StringIO.StringIO(a),sep="\s*")
df = df.pivot_table('score','id','name')
print df
Output:
name A B C
id
0 -0.652909 0.304012 0.358484
1 0.100885 -0.198157 0.616415
2 0.410907 -0.054764 0.389018
3 NaN NaN 1.164172
Upvotes: 1
Reputation: 16508
The function you look for is unstack. In order for pandas
to know, what to unstack for, we will first create a MultiIndex
where we add the column as last index. unstack()
will then unstack (by default) based on the last index layer, so we get exactly what you want:
In[152]: test = pandas.DataFrame({"score": numpy.random.randn(10)})
test["name"] = ["A"] * 3 + ["B"] * 3 + ["C"] * 4
test.index = range(3) + range(3) + range(4)
In[153]: test
Out[153]:
score name
0 -0.208392 A
1 -0.103659 A
2 1.645287 A
0 0.119709 B
1 -0.047639 B
2 -0.479155 B
0 -0.415372 C
1 -1.390416 C
2 -0.384158 C
3 -1.328278 C
In[154]: test.set_index([index, 'name'], inplace=True)
test.unstack()
Out[154]:
score
name A B C
0 -0.208392 0.119709 -0.415372
1 -0.103659 -0.047639 -1.390416
2 1.645287 -0.479155 -0.384158
3 NaN NaN -1.328278
Upvotes: 2