Reputation: 504
I have a very large data file (tens of thousands of rows and columns) formatted similarly to this.
name x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1
gene1 x y 2 3 2 1
gene2 x y 5 7 6 2
My goal for each gene is to find the mean of each set of repetitions.
At the end I would like to only have columns of mean values titled something like "00hr_bio" and delete all the individual repetitions.
My thinking right now is to use something like this:
for row in df:
df[avg] = df.iloc[3:].rolling(window=3, axis=1).mean()
But I have no idea how to actually make this work.
The df.iloc[3]
is my way of trying to start from the 3rd column but I am fairly certain doing it this way does not work.
I don't even know where to begin in terms of "merging" the 3 columns into only 1.
Any suggestions you have will be greatly appreciated as I obviously have no idea what I am doing.
Upvotes: 1
Views: 92
Reputation: 26189
You're pretty close.
df['avg'] = df.iloc[:, 2:].mean(axis=1)
will get you this:
x y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1 avg
gene1 x y 2 3 2 1 2.0
gene2 x y 5 7 6 2 5.0
If you wish to get the mean from different sets of columns, you could do something like this:
for col in range(10):
df['avg%i' % col] = df.iloc[:, 2+col*5:7+col*5].mean(axis=1)
If you have the same number of columns per average. Otherwise you'd probably want to use the name of the rep columns, depending on what your data looks like.
Upvotes: 0
Reputation: 149185
I would first build a Series of final names indexed by the original columns:
names = pd.Series(['_'.join(i.split('_')[:-1]) for i in df.columns[3:]],
index = df.columns[3:])
I would then use it to ask a mean of a groupby on axis 1:
tmp = df.iloc[:, 3:].groupby(names, axis=1).agg('mean')
It gives a new dataframe indexed like the original one and having the averaged columns:
gh_00hr_bio gh_06hr_bio
0 2.333333 1.0
1 6.000000 2.0
You can then horizontally concat it to the first dataframe or to its 3 first columns:
result = pd.concat([df.iloc[:, :3], tmp], axis=1)
to get:
name x y gh_00hr_bio gh_06hr_bio
0 gene1 x y 2.333333 1.0
1 gene2 x y 6.000000 2.0
Upvotes: 1