Danial Tz
Danial Tz

Reputation: 1984

pandas dataframe groupby like mysql, yet into new column

df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df

    A   mask    values
0    11  0   10
1    11  0   15
2    22  0   20
3    22  1   25

Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:

def calculate_df_stats(dfs):

    mask_ = list(dfs['B'])
    mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
    return mean

df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work

and generate:

     A   mask   values   Z
0    11  0        10    12.5
1    22  0        20    25

Whatever I do it only replaces values column with the masked mean.

and can your solution be applied for a function on two columns and return in a new column?

Thanks!

Edit: To clarify more: let's say I have such a table in Mysql:

SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560' 
LIMIT 200;

which gives me such result: http://pastebin.com/qXiaWcJq

If I run now this:

SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560' 
group by `reader_plate_ID`;

I get:

datapoint_ID    plate_ID    coordinate_x    coordinate_y    res_value   wavelength  ignore  avg(action_value)
193 1   0   0   2.1783  560 NULL    2.090027083333334
481 2   0   0   1.7544  560 NULL    1.4695583333333333
769 3   0   0   2.0161  560 NULL    1.6637885416666673

How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.

Upvotes: 0

Views: 1106

Answers (1)

joris
joris

Reputation: 139242

If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):

>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>> 
>>> grouped = df.groupby("A")
>>> 
>>> result = grouped.agg('first')
>>> result
    mask  values
A               
11     0      10
22     0      20

and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':

>>> def calculate_df_stats(dfs):
...     mask_ = list(dfs['mask'])
...     mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
...     return mean
... 
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>> 
>>> result
    mask  values     Z
A                     
11     0      10  12.5
22     0      20  20.0

In your function definition you can always use more columns (just by their name) to return the result.

Upvotes: 2

Related Questions