Groupby and reduce pandas dataframes with numpy arrays as entries

Question

I have a pandas.DataFrame with the following structure:

>>> data 
a    b    values
1    0    [1, 2, 3, 4]
2    0    [3, 4, 5, 6]
1    1    [1, 3, 7, 9]
2    1    [2, 4, 6, 8]

('values' has the type of numpy.array). What I want to do is to group the data by column 'a' and then combine the list of values. My goal is to end up with the following:

>>> data 
a    values
1    [1, 2, 3, 4, 1, 3, 7, 9]
2    [3, 4, 5, 6, 2, 4, 6, 8]

Note, that the order of the values does not matter. How do I achieve this? I though about something like

>>> grps = data.groupby(['a'])
>>> grps['values'].agg(np.concatenate)

but this fails with a KeyError. I'm sure there is a pandaic way to achieve this - but how? Thanks.

cs95 · Accepted Answer

Similar to the John Galt's answer, you can group and then apply np.hstack:

In [278]: df.groupby('a')['values'].apply(np.hstack)
Out[278]: 
a
1    [1, 2, 3, 4, 1, 3, 7, 9]
2    [3, 4, 5, 6, 2, 4, 6, 8]
Name: values, dtype: object

To get back your frame, you'll need pd.Series.to_frame and pd.reset_index:

In [311]: df.groupby('a')['values'].apply(np.hstack).to_frame().reset_index()
Out[311]: 
   a                    values
0  1  [1, 2, 3, 4, 1, 3, 7, 9]
1  2  [3, 4, 5, 6, 2, 4, 6, 8]

Performance

df_test = pd.concat([df] * 10000) # setup

%timeit df_test.groupby('a')['values'].apply(np.hstack) # mine
1 loop, best of 3: 219 ms per loop

%timeit df_test.groupby('a')['values'].sum() # John's 
1 loop, best of 3: 4.44 s per loop

sum is very inefficient for list, and does not work when Values is a np.array.

Groupby and reduce pandas dataframes with numpy arrays as entries

Answers (2)

Related Questions