yanadm
yanadm

Reputation: 707

Calculate the average of the n-th number of elements in the column in pandas

I have the following dataframe:

             df1
index   year   week   a     b     c
 -10    2017    10   45    26    19
  -9    2017    11   37    23    14
  -8    2017    12   21    66    19
  -7    2017    13   47    36    92
  -6    2017    14   82    65    18
  -5    2017    15   68    68    19
  -4    2017    16   30    95    24
  -3    2017    17   21    15    94
  -2    2017    18   67    30    16
  -1    2017    19   10    13    13
   0    2017    20   26    22    18
   1    2017    21   NaN   NaN   NaN
   2    2017    22   NaN   NaN   NaN
   3    2017    23   NaN   NaN   NaN
   4    2017    24   NaN   NaN   NaN
   ...
   53   2018    20   NaN   NaN   NaN

I need for each empty cell to calculate the average of the previous nth values in a column and write this value into a cell. n is equal to the number of indexes from zero and up. For example, for the first empty cell in the column a I must calculate the average between the indexes 0 and -10. Then for the next cell between 1 and -9 and so on. And so do for columns a, b and c. And calculations always begin where the index = 1.

And the problem is that the number of columns such as a, b, c can be different. But I know that these columns will always be after the column week. Is it possible to apply these calculations to an indefinite number of columns, but if it is known that these columns will be located after the column week?

I tried hard to find anything, but I could not find anything suitable.

UPD: If this helps, the maximum number of rows from index = 0 and down will be 53.

Upvotes: 4

Views: 1371

Answers (2)

Rayhane Mama
Rayhane Mama

Reputation: 2424

This can be done like follows:

n = 11 # in the example of your explanation
df = df1.loc[range(1,df1.index[-1]+1)] # select rows from index 1 above

df should look like:

       year  week   a   b   c
index                        
1      2017    21 NaN NaN NaN
2      2017    22 NaN NaN NaN
3      2017    23 NaN NaN NaN
4      2017    24 NaN NaN NaN

then you:

for s in list(df.index): # iterate through rows with nan values
    for i in range(2,df.columns.size): # iterate through different cols ('a','b','c' or more)
        df1.loc[s,df.columns[i]] = df1.loc[range(s-n,s),df.columns[i]].sum()/n
print(df1)

please notice that in this case i followed your example and supposed that year will be always first column and week always second in order to select all columns after week and index is.. well the index

output:

       year  week          a          b          c
index                                             
-10    2017    10  45.000000  26.000000  19.000000
-9     2017    11  37.000000  23.000000  14.000000
-8     2017    12  21.000000  66.000000  19.000000
-7     2017    13  47.000000  36.000000  92.000000
-6     2017    14  82.000000  65.000000  18.000000
-5     2017    15  68.000000  68.000000  19.000000
-4     2017    16  30.000000  95.000000  24.000000
-3     2017    17  21.000000  15.000000  94.000000
-2     2017    18  67.000000  30.000000  16.000000
-1     2017    19  10.000000  13.000000  13.000000
 0     2017    20  26.000000  22.000000  18.000000
 1     2017    21  41.272727  41.727273  31.454545
 2     2017    22  40.933884  43.157025  32.586777
 3     2017    23  41.291510  44.989482  34.276484
 4     2017    24  43.136193  43.079434  35.665255

Upvotes: 2

Rudresh Panchal
Rudresh Panchal

Reputation: 1000

You can do something like this by playing around a bit with pandas and numpy. Assuming that you know what the index of the week column will be(even if you don't, a simple search will get you the index), like example, the week column is 3rd you can do something like

import numpy as np
import pandas as pd
#data is your dataframe name
column_list = list(data.columns.values)[3:]
for column_name in column_list :
    column = data[column_name].values
    #converted pandas series to numpy series
    for index in xrange(0,column.shape[0]):
        #iterating over entries in the column
        if np.isnan(column[index]):
            column[index] = np.nanmean(column.take(range(index-10,index+1),mode='wrap'))

This is a bad unvectorized solution, but should work fine. It will replace all NaN entries with the previous 10 entries wrapped around. If you instead wanted only the previous 10 without a wrap around, you simply take the first n for n lesser than 10, like
new_df[index] = np.nanmean(new_df[max(0,index-10):index+1])

Hope this helps!

Upvotes: 1

Related Questions