Efficiently read in data in pandas dataframe and get average over a certain range

Question

I'm reading in quite a lot of data from some text files and save it to dataframes to easily be able to access the columns later when processing the data in a later stage. I feel the way I'm currently doing it is very inefficient and would like help with a better way of doing it. The text files are values for different variables at every time step of a simulation. The simulation are grouped by into categories and I want the average in every category. I don't want to take the average of every variable that is outputted but only the ones that I choose.

Some basics of the data. The number of files is known. The number of dataframes I want to average over is known (all files from a category). The columns are always the same in all files and thus the dataframes. The number of and name of categories are known.

Below is some simplified semi-pseudo code of what I'm currently doing. Note that it's more files and columns in the raw data.

sim_nr = np.arange(1,13)
categories = ['a','b','c']
variables = ['S', 'P', 'L']
multi_column = pd.MultiIndex.from_tuples(list(zip(categories, sim_nr)))

all_data = {}
for i in number_of_files:
    data = pd.read_csv('data_'+str(i)+'.txt')
    all_data[mutlicolumn[i]] = data

#for working example use this dictionary, 
all_data = {('a',1): pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[78,5,3,6], 'M':[1,2,3,4], 'L':[23,5,61,85]}),
            ('a',2): pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[78,5,3,6], 'M':[1,2,3,4], 'L':[32,73,30,41]}), 
            ('b',3):pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[2,3,4,5], 'M':[1,2,3,4], 'L':[32,73,30,41]}),
            ('b',4):pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[78,5,3,6], 'M':[1,2,3,4], 'L':[32,73,30,41]}),
            ('c',5):pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[11,20,55,8], 'M':[1,2,3,4], 'L':[23,5,61,85]}),
            ('c',6):pd.DataFrame({'S':[1,2,3,4], 'Q':[1,2,3,4], 'P':[78,5,3,6], 'M':[1,2,3,4], 'L':[32,73,30,41]})}

variableS = pd.DataFrame()
variableP = pd.DataFrame()
variableL = pd.DataFrame()

for cat in categories:
    test = pd.DataFrame()
    for key, data in all_data.items():
        if key[0] == cat:
            # print(key[0])
            test= pd.concat([test, data[variables]], axis= 1)
    variableS[cat] = test['S'].mean(axis=1)
    variableP[cat] = test['P'].mean(axis=1)
    variableL[cat] = test['L'].mean(axis=1)

I would like help with how I can improve on this. Especially the last loop with how to save the averaged values of the different variables.

mozway · Accepted Answer

You could first concat your datasets in a single DataFrame, then use groupby.mean to aggregate the data per index+second level in the tuple, then swap the first index level and columns with stack/unstack, finally split the groups with groupby and a dictionary comprehension:

categories = ['a', 'b', 'c']
variables = ['S', 'P', 'L']

# reshape and aggregate as mean
tmp = (pd.concat(all_data)
         .loc[categories, variables]
         .groupby(level=[0, 2]).mean()
         .stack().unstack(0)
       )

# split the variables
out = {k: v.droplevel(1) for k,v in tmp.groupby(level=1, sort=False)}
# or
# out = dict(list(tmp.droplevel(1)
#                    .groupby(tmp.index.get_level_values(1),
#                             sort=False)))

Output:

{'L':       a     b     c
      0  27.5  32.0  27.5
      1  39.0  73.0  39.0
      2  45.5  30.0  45.5
      3  63.0  41.0  63.0,
 'P':       a     b     c
      0  78.0  40.0  44.5
      1   5.0   4.0  12.5
      2   3.0   3.5  29.0
      3   6.0   5.5   7.0,
 'S':      a    b    c
      0  1.0  1.0  1.0
      1  2.0  2.0  2.0
      2  3.0  3.0  3.0
      3  4.0  4.0  4.0,
}

Intermediates:

# pd.concat(all_data).loc[categories, variables].groupby(level=[0, 2]).mean()

       S     P     L
a 0  1.0  78.0  27.5
  1  2.0   5.0  39.0
  2  3.0   3.0  45.5
  3  4.0   6.0  63.0
b 0  1.0  40.0  32.0
  1  2.0   4.0  73.0
  2  3.0   3.5  30.0
  3  4.0   5.5  41.0
c 0  1.0  44.5  27.5
  1  2.0  12.5  39.0
  2  3.0  29.0  45.5
  3  4.0   7.0  63.0

# .stack().unstack(0)

        a     b     c
0 S   1.0   1.0   1.0
  P  78.0  40.0  44.5
  L  27.5  32.0  27.5
1 S   2.0   2.0   2.0
  P   5.0   4.0  12.5
  L  39.0  73.0  39.0
2 S   3.0   3.0   3.0
  P   3.0   3.5  29.0
  L  45.5  30.0  45.5
3 S   4.0   4.0   4.0
  P   6.0   5.5   7.0
  L  63.0  41.0  63.0

Efficiently read in data in pandas dataframe and get average over a certain range

Answers (1)

Related Questions