keenan
keenan

Reputation: 504

Finding the mean of consecutive columns

I have a very large data file (tens of thousands of rows and columns) formatted similarly to this.

name   x  y gh_00hr_bio_rep1 gh_00hr_bio_rep2 gh_00hr_bio_rep3 gh_06hr_bio_rep1
gene1  x  y         2               3               2               1
gene2  x  y         5               7               6               2

My goal for each gene is to find the mean of each set of repetitions.

At the end I would like to only have columns of mean values titled something like "00hr_bio" and delete all the individual repetitions.

My thinking right now is to use something like this:

for row in df:
    df[avg] = df.iloc[3:].rolling(window=3, axis=1).mean() 

But I have no idea how to actually make this work.

The df.iloc[3] is my way of trying to start from the 3rd column but I am fairly certain doing it this way does not work.

I don't even know where to begin in terms of "merging" the 3 columns into only 1.

Any suggestions you have will be greatly appreciated as I obviously have no idea what I am doing.

Upvotes: 1

Views: 92

Answers (2)

Jonas Byström
Jonas Byström

Reputation: 26189

You're pretty close.

df['avg'] = df.iloc[:, 2:].mean(axis=1)

will get you this:

       x  y  gh_00hr_bio_rep1  gh_00hr_bio_rep2  gh_00hr_bio_rep3  gh_06hr_bio_rep1  avg
gene1  x  y                 2                 3                 2                 1  2.0
gene2  x  y                 5                 7                 6                 2  5.0

If you wish to get the mean from different sets of columns, you could do something like this:

for col in range(10):
    df['avg%i' % col] = df.iloc[:, 2+col*5:7+col*5].mean(axis=1)

If you have the same number of columns per average. Otherwise you'd probably want to use the name of the rep columns, depending on what your data looks like.

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 149185

I would first build a Series of final names indexed by the original columns:

names = pd.Series(['_'.join(i.split('_')[:-1]) for i in df.columns[3:]],
                  index = df.columns[3:])

I would then use it to ask a mean of a groupby on axis 1:

tmp = df.iloc[:, 3:].groupby(names, axis=1).agg('mean')

It gives a new dataframe indexed like the original one and having the averaged columns:

   gh_00hr_bio  gh_06hr_bio
0     2.333333          1.0
1     6.000000          2.0

You can then horizontally concat it to the first dataframe or to its 3 first columns:

result = pd.concat([df.iloc[:, :3], tmp], axis=1)

to get:

    name  x  y  gh_00hr_bio  gh_06hr_bio
0  gene1  x  y     2.333333          1.0
1  gene2  x  y     6.000000          2.0

Upvotes: 1

Related Questions