Reputation: 2500
I have pandas df with say, 100 rows, 10 columns, (actual data is huge). I also have row_index list which contains, which rows to be considered to take mean. I want to calculate mean on say columns 2,5,6,7 and 8. Can we do it with some function for dataframe object?
What I know is do a for loop, get value of row for each element in row_index and keep doing mean. Do we have some direct function where we can pass row_list, and column_list and axis, for ex df.meanAdvance(row_list,column_list,axis=0)
?
I have seen DataFrame.mean() but it didn't help I guess.
a b c d q
0 1 2 3 0 5
1 1 2 3 4 5
2 1 1 1 6 1
3 1 0 0 0 0
I want mean of 0, 2, 3
rows for each a, b, d
columns
a b d
0 1 1 2
Upvotes: 25
Views: 134867
Reputation: 15545
You can select specific columns from a DataFrame by passing a list of indices to .iloc
, for example:
df.iloc[:, [2,5,6,7,8]]
Will return a DataFrame containing those numbered columns (note: This uses 0-based indexing, so 2
refers to the 3rd column.)
To take a mean down of that column, you could use:
# Mean along 0 (vertical) axis: return mean for specified columns, calculated across all rows
df.iloc[:, [2,5,6,7,8]].mean(axis=0)
To take a mean across that column, you could use:
# Mean along 1 (horizontal) axis: return mean for each row, calculated across specified columns
df.iloc[:, [2,5,6,7,8]].mean(axis=1)
You can also supply specific indices for both axes to return a subset of the table:
df.iloc[[1,2,3,4], [2,5,6,7,8]]
For your specific example, you would do:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.array([[1,2,3,0,5],[1,2,3,4,5],[1,1,1,6,1],[1,0,0,0,0]]),
columns=["a","b","c","d","q"],
index = [0,1,2,3]
)
#I want mean of 0, 2, 3 rows for each a, b, d columns
#. a b d
#0 1 1 2
df.iloc[ [0,2,3], [0,1,3] ].mean(axis=0)
Which outputs:
a 1.0
b 1.0
d 2.0
dtype: float64
Alternatively, to access via column names, first select on those:
df[ ['a','b','d'] ].iloc[ [0,1,3] ].mean(axis=0)
To answer the second part of your question (from the comments) you can join multiple DataFrames together using pd.concat
. It is faster to accumulate the frames in a list and then pass to pd.concat
in one go, e.g.
dfs = []
for ix in idxs:
dfm = df.iloc[ [0,2,3], ix ].mean(axis=0)
dfs.append(dfm)
dfm_summary = pd.concat(dfs, axis=1) # Stack horizontally
Upvotes: 8
Reputation: 3677
To select the rows of your dataframe you can use iloc, you can then select the columns you want using square brackets.
For example:
df = pd.DataFrame(data=[[1,2,3]]*5, index=range(3, 8), columns = ['a','b','c'])
gives the following dataframe:
a b c
3 1 2 3
4 1 2 3
5 1 2 3
6 1 2 3
7 1 2 3
to select only the 3d and fifth row you can do:
df.iloc[[2,4]]
which returns:
a b c
5 1 2 3
7 1 2 3
if you then want to select only columns b and c you use the following command:
df[['b', 'c']].iloc[[2,4]]
which yields:
b c
5 2 3
7 2 3
To then get the mean of this subset of your dataframe you can use the df.mean function. If you want the means of the columns you can specify axis=0, if you want the means of the rows you can specify axis=1
thus:
df[['b', 'c']].iloc[[2,4]].mean(axis=0)
returns:
b 2
c 3
As we should expect from the input dataframe.
For your code you can then do:
df[column_list].iloc[row_index_list].mean(axis=0)
EDIT after comment: New question in comment: I have to store these means in another df/matrix. I have L1, L2, L3, L4...LX lists which tells me the index whose mean I need for columns C[1, 2, 3]. For ex: L1 = [0, 2, 3] , means I need mean of rows 0,2,3 and store it in 1st row of a new df/matrix. Then L2 = [1,4] for which again I will calculate mean and store it in 2nd row of the new df/matrix. Similarly till LX, I want the new df to have X rows and len(C) columns. Columns for L1..LX will remain same. Could you help me with this?
Answer:
If i understand correctly, the following code should do the trick (Same df as above, as columns I took 'a' and 'b':
first you loop over all the lists of rows, collection all the means as pd.series, then you concatenate the resulting list of series over axis=1, followed by taking the transpose to get it in the right format.
dfs = list()
for l in L:
dfs.append(df[['a', 'b']].iloc[l].mean(axis=0))
mean_matrix = pd.concat(dfs, axis=1).T
Upvotes: 23