Reputation: 5
I have a python dataframe and some columns refer to repeated samples as below:
In [3]: df = pd.DataFrame(
...: [[89, 89, 12, 34, 32],
...: [788, 25, 55, 65, 55],
...: [588, 23, 58, 8, 55],
...: [25, 14, 45, 123, 58]],
...: columns = ['sample1','sample2.1','sample2.2','sample3','sample4'],
...: )
In [4]: df
sample1 sample2.1 sample2.2 sample3 sample4
0 89 89 12 34 32
1 788 25 55 65 55
2 588 23 58 8 55
3 25 14 45 123 58
for the repeated samples, sample2.1
and sample2.2
, I want to remain with an average of the two, i.e
sample1 sample2_averaged sample3 sample4
0 89 50.5 34 32
1 788 40.0 65 55
2 588 40.5 8 55
3 25 29.5 123 58
I am thinking of using regex but I have never used them on python dataframes
Upvotes: 0
Views: 47
Reputation: 13397
Try:
import re
from itertools import groupby
res=pd.DataFrame(index=df.index, columns=[])
for k,v in groupby(df.columns, key=lambda el: re.sub(r"\.[^\.]+$", "", el)):
v=list(v)
if(len(v)==1):
res[k]=df[v[0]]
else:
res[k]=df[v].mean(axis=1)
Outputs:
>>> res
sample1 sample2 sample3 sample4
0 89 50.5 34 32
1 788 40.0 65 55
2 588 40.5 8 55
3 25 29.5 123 58
Upvotes: 0
Reputation: 15452
You can group by columns if you provide axis=1
, e.g.:
>>> df.groupby(df.columns.str.replace(r'\..+', ''), axis=1).mean()
sample1 sample2 sample3 sample4
0 89.0 50.5 34.0 32.0
1 788.0 40.0 65.0 55.0
2 588.0 40.5 8.0 55.0
3 25.0 29.5 123.0 58.0
Pandas columns and indices can use the pandas.Series.str
string accessor methods, including regex.
Upvotes: 3
Reputation: 150785
I would do:
(df.T.groupby(df.columns.str.extract('^([^\.]+)')[0].values)
.mean().T
)
Output:
sample1 sample2 sample3 sample4
0 89.0 50.5 34.0 32.0
1 788.0 40.0 65.0 55.0
2 588.0 40.5 8.0 55.0
3 25.0 29.5 123.0 58.0
Upvotes: 0