R Odongo
R Odongo

Reputation: 5

Find the average of similar dataframe columns

I have a python dataframe and some columns refer to repeated samples as below:

In [3]: df = pd.DataFrame(
   ...:     [[89, 89, 12, 34, 32],
   ...:      [788, 25, 55, 65, 55],
   ...:      [588, 23, 58, 8, 55],
   ...:      [25, 14, 45, 123, 58]],
   ...:     columns = ['sample1','sample2.1','sample2.2','sample3','sample4'],
   ...: )
In [4]: df
   sample1  sample2.1  sample2.2  sample3  sample4
0       89         89         12       34       32
1      788         25         55       65       55
2      588         23         58        8       55
3       25         14         45      123       58

for the repeated samples, sample2.1 and sample2.2, I want to remain with an average of the two, i.e

   sample1  sample2_averaged  sample3  sample4
0       89              50.5       34       32
1      788              40.0       65       55
2      588              40.5        8       55
3       25              29.5      123       58

I am thinking of using regex but I have never used them on python dataframes

Upvotes: 0

Views: 47

Answers (3)

Georgina Skibinski
Georgina Skibinski

Reputation: 13397

Try:

import re
from itertools import groupby

res=pd.DataFrame(index=df.index, columns=[])
for k,v in groupby(df.columns, key=lambda el: re.sub(r"\.[^\.]+$", "", el)):
    v=list(v)
    if(len(v)==1):
        res[k]=df[v[0]]
    else:
        res[k]=df[v].mean(axis=1)

Outputs:

>>> res

   sample1  sample2  sample3  sample4
0       89     50.5       34       32
1      788     40.0       65       55
2      588     40.5        8       55
3       25     29.5      123       58

Upvotes: 0

Michael Delgado
Michael Delgado

Reputation: 15452

You can group by columns if you provide axis=1, e.g.:

>>> df.groupby(df.columns.str.replace(r'\..+', ''), axis=1).mean()
    sample1 sample2 sample3 sample4
0   89.0    50.5    34.0    32.0
1   788.0   40.0    65.0    55.0
2   588.0   40.5    8.0 55.0
3   25.0    29.5    123.0   58.0

Pandas columns and indices can use the pandas.Series.str string accessor methods, including regex.

Upvotes: 3

Quang Hoang
Quang Hoang

Reputation: 150785

I would do:

(df.T.groupby(df.columns.str.extract('^([^\.]+)')[0].values)
   .mean().T
)

Output:

    sample1  sample2  sample3  sample4
0     89.0     50.5     34.0     32.0
1    788.0     40.0     65.0     55.0
2    588.0     40.5      8.0     55.0
3     25.0     29.5    123.0     58.0

Upvotes: 0

Related Questions