Joey
Joey

Reputation: 934

loop for computing average of selected data in dataframe using pandas

I have a 3 row x 96 column dataframe. I'm trying to computer the average of the two rows beneath the index (row1:96) for every 12 data points. here is my dataframe:

 Run 1       Run 2      Run 3       Run 4       Run 5       Run 6  \
0  1461274.92  1458079.44  1456807.1  1459216.08  1458643.24  1457145.19   
1   478167.44   479528.72  480316.08   475569.52   472989.01   476054.89   
2      ------      ------     ------      ------      ------      ------   

    Run 7       Run 8       Run 9      Run 10     ...          Run 87  \
0  1458117.08  1455184.82  1455768.69  1454738.07     ...      1441822.45   
1   473630.89   476282.93   475530.87   474200.22     ...        468525.2   
2      ------      ------      ------      ------     ...          ------   

   Run 88      Run 89      Run 90      Run 91      Run 92      Run 93  \
0  1445339.53  1461050.97  1446849.43  1438870.43  1431275.76  1430781.28   
1    460076.8   473263.06   455885.07   475245.64   483875.35   487065.25   
2      ------      ------      ------      ------      ------      ------   

   Run 94      Run 95      Run 96  
0  1436007.32  1435238.23  1444300.51  
1   474328.87   475789.12   458681.11  
2      ------      ------      ------  

[3 rows x 96 columns]

Currently I am trying to use df.irow(0) to select all the data in row index 0.

something along the lines of:

selection = np.arange(0,13)

for i in selection:
    new_df = pd.DataFrame()
    data = df.irow(0)

    ........

then i get lost

I just don't know how to link this range with the dataframe in order to computer the mean for every 12 data points in each column.

To summarize, I want the average for every 12 runs in each column. So, i should end up with a separate dataframe with 2 * 8 average values (96/12). any ideas?

thanks.

Upvotes: 0

Views: 3602

Answers (2)

DSM
DSM

Reputation: 353059

You can do a groupby on axis=1 (using some dummy data I made up):

>>> h = df.iloc[:2].astype(float)
>>> h.groupby(np.arange(len(h.columns))//12, axis=1).mean()
          0         1         2         3         4         5         6         7
0  0.609643  0.452047  0.536786  0.377845  0.544321  0.214615  0.541185  0.544462
1  0.382945  0.596034  0.659157  0.437576  0.490161  0.435382  0.476376  0.423039

First we extract the data and force recognition of a float (the presence of the ------ row means that you've probably got an object dtype, which will make the mean unhappy.)

Then we make an array saying what groups we want to put the different columns in:

>>> np.arange(len(df.columns))//12
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5,
       5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7], dtype=int32)

which we feed as an argument to groupby. .mean() handles the rest.

Upvotes: 2

holdenweb
holdenweb

Reputation: 37023

It's always best to try to use pandas methods when you can, rather than iterating over the rows. The DataFrame's iloc method is useful for extracting any number of rows.

The following example shows you how to do what you want in a two-column DataFrame. The same technique will work independent of the number of columns:

In [14]: df = pd.DataFrame({"x": [1, 2, "-"], "y": [3, 4, "-"]})

In [15]: df
Out[15]:
   x  y
0  1  3
1  2  4
2  -  -

In [16]: df.iloc[2] = df.iloc[0:2].sum()

In [17]: df
Out[17]:
   x  y
0  1  3
1  2  4
2  3  7

However, in your case you want to sum each group of eight cells in df.iloc[2]`, so you might be better simply taking the result of the summing expression with the statement

ds = df.iloc[0:2].sum()

which with your data will have the form

col1    0
col2    1
col3    2
col4    3
   ...
col93  92
col94  93
col95  94
col96  95

(These numbers are representative, you will obviously see your column sums). You can then turn this into a 12x8 matrix with

ds.values.reshape(12, 8)

whose value is

array([[ 0,  1,  2,  3,  4,  5,  6,  7],
   [ 8,  9, 10, 11, 12, 13, 14, 15],
   [16, 17, 18, 19, 20, 21, 22, 23],
   [24, 25, 26, 27, 28, 29, 30, 31],
   [32, 33, 34, 35, 36, 37, 38, 39],
   [40, 41, 42, 43, 44, 45, 46, 47],
   [48, 49, 50, 51, 52, 53, 54, 55],
   [56, 57, 58, 59, 60, 61, 62, 63],
   [64, 65, 66, 67, 68, 69, 70, 71],
   [72, 73, 74, 75, 76, 77, 78, 79],
   [80, 81, 82, 83, 84, 85, 86, 87],
   [88, 89, 90, 91, 92, 93, 94, 95]])

but summing this array will give you the sum of all elements, so instead create another DataFrame with

rs = pd.DataFrame(ds.values.reshape(12, 8))

and then sum that:

rs.sum()

giving

0    528
1    540
2    552
3    564
4    576
5    588
6    600
7    612
dtype: int64

You may find in practice that it is easier to simply create two 12x8 matrices in the first place, which you can add together before creating a dataframe which you can then sum. Much depends on how you are reading your data.

Upvotes: 1

Related Questions