Rojj
Rojj

Reputation: 1210

Pandas multiindex from series of dataframes

I have a series of dataframes with identical structure that represent results of a simulation for each hour of the year. Each simulation contains results for a series of coordinates (x,y).

Each dataframe is imported from a csv file that has time information only in the file name. Example:

results_YYMMDDHH.csv

contains data such

   x   y         a         b
 0.0 0.0  0.318705 -0.871259
 0.1 0.0 -0.937012  0.704270
 0.1 0.1 -0.032225 -1.939544
 0.0 0.1 -1.874781 -0.033073

I would like to create a single MultiIndexed Dataframe (level 0 is time and level 1 is (x,y)) that would allow me to perform various operations like averages, sums, max, etc. between these dataframes using the resampling or groupby methods. For each time step

The resulting dataframe should look something like this

                       x   y         a         b
2010-01-01 10:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.934512  0.745270
                     0.1 0.1 -0.0334525 -1.963544
                     0.0 0.1 -1.835781 -0.067573

2010-01-01 11:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.923012  0.745670
                     0.1 0.1 -0.035225 -1.963544
                     0.0 0.1 -1.835781 -0.067573
.................
.................
2010-12-01 10:00     0.0 0.0  0.318705 -0.871259
                     0.1 0.0 -0.923012  0.723270
                     0.1 0.1 -0.034225 -1.963234
                     0.0 0.1 -1.835781 -0.067233

You can imagine this for each hour of the year. I would like now to be able to calculate for example the average for the whole year or the average for June. Also any other function like the number of hours above a certain threshold or between a min and a max value. Please bear in mind that the result should be in any of these operations a DataFrame. For example the monthly averages should look like

              x   y     a     b
2010-01     0.0 0.0  0.45 -0.13
2010-02     0.1 0.0  0.55 -0.87
2010-03     0.1 0.1  0.24 -0.83
2010-04     0.0 0.1  0.11 -0.87

How do I build this MultiIndexed dataframe? I picture this like a timeseries of dataframes.

Upvotes: 2

Views: 1436

Answers (2)

Brian from QuantRocket
Brian from QuantRocket

Reputation: 5493

Here is a different answer from my earlier one, in light of the more fully explained question. Iterate through the files and read them into pandas, parse the date and add it to the dataframe, then use set_index to create your multiindex. Once you've got all your dataframes, use pd.concat to combine them:

dataframes = []
for filename in filenames:
    df = pd.read_csv(filename)
    df["datetime"] = datetime.datetime.strptime(filename[8:18], "%Y%m%d%H")
    dataframes.append(df.set_index(["datetime","x", "y"]))

combined_df = pd.concat(dataframes)

Upvotes: 1

Brian from QuantRocket
Brian from QuantRocket

Reputation: 5493

I would make a Panel then convert it into a multiindexed DataFrame using to_frame():

In [29]: df1 = pd.DataFrame(dict(a=[0.318705,-0.937012,-0.032225,-1.874781], b=[-0.871259,0.704270,-1.939544,-0.033073]))

In [30]: df2 = pd.DataFrame(dict(a=[0.318705,-0.937012,-0.032225,-1.874781], b=[-0.871259,0.704270,-1.939544,-0.033073]))

In [31]: df1
Out[31]:
          a         b
0  0.318705 -0.871259
1 -0.937012  0.704270
2 -0.032225 -1.939544
3 -1.874781 -0.033073

In [32]: data = {datetime.datetime(2010,6,21,10,0,0): df1, datetime.datetime(2010,6,22,10,0,0): df2}

In [33]: p = pd.Panel(data)

In [34]: p.to_frame()
Out[34]:
             2010-06-21 10:00:00  2010-06-22 10:00:00
major minor
0     a                 0.318705             0.318705
      b                -0.871259            -0.871259
1     a                -0.937012            -0.937012
      b                 0.704270             0.704270
2     a                -0.032225            -0.032225
      b                -1.939544            -1.939544
3     a                -1.874781            -1.874781
      b                -0.033073            -0.033073

Depending on how you want to look at your data, you can use swapaxes to rearrange it:

In [35]: p.swapaxes("major", "items").to_frame()
Out[35]:
                                  0         1         2         3
major               minor
2010-06-21 10:00:00 a      0.318705 -0.937012 -0.032225 -1.874781
                    b     -0.871259  0.704270 -1.939544 -0.033073
2010-06-22 10:00:00 a      0.318705 -0.937012 -0.032225 -1.874781
                    b     -0.871259  0.704270 -1.939544 -0.033073

Upvotes: 3

Related Questions