impossible
impossible

Reputation: 2500

Calculate mean for selected rows for selected columns in pandas data frame

I have pandas df with say, 100 rows, 10 columns, (actual data is huge). I also have row_index list which contains, which rows to be considered to take mean. I want to calculate mean on say columns 2,5,6,7 and 8. Can we do it with some function for dataframe object?

What I know is do a for loop, get value of row for each element in row_index and keep doing mean. Do we have some direct function where we can pass row_list, and column_list and axis, for ex df.meanAdvance(row_list,column_list,axis=0) ?

I have seen DataFrame.mean() but it didn't help I guess.

  a b c d q 
0 1 2 3 0 5
1 1 2 3 4 5
2 1 1 1 6 1
3 1 0 0 0 0

I want mean of 0, 2, 3 rows for each a, b, d columns

  a b d
0 1 1 2

Upvotes: 25

Views: 134867

Answers (2)

mfitzp
mfitzp

Reputation: 15545

You can select specific columns from a DataFrame by passing a list of indices to .iloc, for example:

df.iloc[:, [2,5,6,7,8]]

Will return a DataFrame containing those numbered columns (note: This uses 0-based indexing, so 2 refers to the 3rd column.)

To take a mean down of that column, you could use:

# Mean along 0 (vertical) axis: return mean for specified columns, calculated across all rows
df.iloc[:, [2,5,6,7,8]].mean(axis=0) 

To take a mean across that column, you could use:

# Mean along 1 (horizontal) axis: return mean for each row, calculated across specified columns
df.iloc[:, [2,5,6,7,8]].mean(axis=1)

You can also supply specific indices for both axes to return a subset of the table:

df.iloc[[1,2,3,4], [2,5,6,7,8]]

For your specific example, you would do:

import pandas as pd
import numpy as np

df = pd.DataFrame( 
np.array([[1,2,3,0,5],[1,2,3,4,5],[1,1,1,6,1],[1,0,0,0,0]]),
columns=["a","b","c","d","q"],
index = [0,1,2,3]
)

#I want mean of 0, 2, 3 rows for each a, b, d columns
#. a b d
#0 1 1 2

df.iloc[ [0,2,3], [0,1,3] ].mean(axis=0)

Which outputs:

a    1.0
b    1.0
d    2.0
dtype: float64

Alternatively, to access via column names, first select on those:

df[ ['a','b','d'] ].iloc[ [0,1,3] ].mean(axis=0)

To answer the second part of your question (from the comments) you can join multiple DataFrames together using pd.concat. It is faster to accumulate the frames in a list and then pass to pd.concat in one go, e.g.

dfs = []
for ix in idxs:
    dfm = df.iloc[ [0,2,3], ix ].mean(axis=0)
    dfs.append(dfm)

dfm_summary = pd.concat(dfs, axis=1) # Stack horizontally

Upvotes: 8

PdevG
PdevG

Reputation: 3677

To select the rows of your dataframe you can use iloc, you can then select the columns you want using square brackets.

For example:

 df = pd.DataFrame(data=[[1,2,3]]*5, index=range(3, 8), columns = ['a','b','c'])

gives the following dataframe:

   a  b  c
3  1  2  3
4  1  2  3
5  1  2  3
6  1  2  3
7  1  2  3

to select only the 3d and fifth row you can do:

df.iloc[[2,4]]

which returns:

   a  b  c
5  1  2  3
7  1  2  3

if you then want to select only columns b and c you use the following command:

df[['b', 'c']].iloc[[2,4]]

which yields:

   b  c
5  2  3
7  2  3

To then get the mean of this subset of your dataframe you can use the df.mean function. If you want the means of the columns you can specify axis=0, if you want the means of the rows you can specify axis=1

thus:

df[['b', 'c']].iloc[[2,4]].mean(axis=0)

returns:

b    2
c    3

As we should expect from the input dataframe.

For your code you can then do:

 df[column_list].iloc[row_index_list].mean(axis=0)

EDIT after comment: New question in comment: I have to store these means in another df/matrix. I have L1, L2, L3, L4...LX lists which tells me the index whose mean I need for columns C[1, 2, 3]. For ex: L1 = [0, 2, 3] , means I need mean of rows 0,2,3 and store it in 1st row of a new df/matrix. Then L2 = [1,4] for which again I will calculate mean and store it in 2nd row of the new df/matrix. Similarly till LX, I want the new df to have X rows and len(C) columns. Columns for L1..LX will remain same. Could you help me with this?

Answer:

If i understand correctly, the following code should do the trick (Same df as above, as columns I took 'a' and 'b':

first you loop over all the lists of rows, collection all the means as pd.series, then you concatenate the resulting list of series over axis=1, followed by taking the transpose to get it in the right format.

dfs = list()
for l in L:
    dfs.append(df[['a', 'b']].iloc[l].mean(axis=0))

mean_matrix = pd.concat(dfs, axis=1).T

Upvotes: 23

Related Questions