Reputation: 1118
I have a pandas.DataFrame
with the following structure:
>>> data
a b values
1 0 [1, 2, 3, 4]
2 0 [3, 4, 5, 6]
1 1 [1, 3, 7, 9]
2 1 [2, 4, 6, 8]
('values'
has the type of numpy.array
). What I want to do is to group the data by column 'a'
and then combine the list of values.
My goal is to end up with the following:
>>> data
a values
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Note, that the order of the values does not matter. How do I achieve this? I though about something like
>>> grps = data.groupby(['a'])
>>> grps['values'].agg(np.concatenate)
but this fails with a KeyError
. I'm sure there is a pandaic way to achieve this - but how?
Thanks.
Upvotes: 4
Views: 6936
Reputation: 402603
Similar to the John Galt's answer, you can group and then apply np.hstack
:
In [278]: df.groupby('a')['values'].apply(np.hstack)
Out[278]:
a
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Name: values, dtype: object
To get back your frame, you'll need pd.Series.to_frame
and pd.reset_index
:
In [311]: df.groupby('a')['values'].apply(np.hstack).to_frame().reset_index()
Out[311]:
a values
0 1 [1, 2, 3, 4, 1, 3, 7, 9]
1 2 [3, 4, 5, 6, 2, 4, 6, 8]
Performance
df_test = pd.concat([df] * 10000) # setup
%timeit df_test.groupby('a')['values'].apply(np.hstack) # mine
1 loop, best of 3: 219 ms per loop
%timeit df_test.groupby('a')['values'].sum() # John's
1 loop, best of 3: 4.44 s per loop
sum
is very inefficient for list, and does not work when Values
is a np.array
.
Upvotes: 3
Reputation: 76927
You can use sum
to join lists.
In [640]: data.groupby('a')['values'].sum()
Out[640]:
a
1 [1, 2, 3, 4, 1, 3, 7, 9]
2 [3, 4, 5, 6, 2, 4, 6, 8]
Name: values, dtype: object
Or,
In [653]: data.groupby('a', as_index=False).agg({'values': 'sum'})
Out[653]:
a values
0 1 [1, 2, 3, 4, 1, 3, 7, 9]
1 2 [3, 4, 5, 6, 2, 4, 6, 8]
Upvotes: 1