Reputation: 1210
I have a series of dataframes with identical structure that represent results of a simulation for each hour of the year. Each simulation contains results for a series of coordinates (x,y).
Each dataframe is imported from a csv file that has time information only in the file name. Example:
results_YYMMDDHH.csv
contains data such
x y a b
0.0 0.0 0.318705 -0.871259
0.1 0.0 -0.937012 0.704270
0.1 0.1 -0.032225 -1.939544
0.0 0.1 -1.874781 -0.033073
I would like to create a single MultiIndexed Dataframe (level 0 is time and level 1 is (x,y)) that would allow me to perform various operations like averages, sums, max, etc. between these dataframes using the resampling or groupby methods. For each time step
The resulting dataframe should look something like this
x y a b
2010-01-01 10:00 0.0 0.0 0.318705 -0.871259
0.1 0.0 -0.934512 0.745270
0.1 0.1 -0.0334525 -1.963544
0.0 0.1 -1.835781 -0.067573
2010-01-01 11:00 0.0 0.0 0.318705 -0.871259
0.1 0.0 -0.923012 0.745670
0.1 0.1 -0.035225 -1.963544
0.0 0.1 -1.835781 -0.067573
.................
.................
2010-12-01 10:00 0.0 0.0 0.318705 -0.871259
0.1 0.0 -0.923012 0.723270
0.1 0.1 -0.034225 -1.963234
0.0 0.1 -1.835781 -0.067233
You can imagine this for each hour of the year. I would like now to be able to calculate for example the average for the whole year or the average for June. Also any other function like the number of hours above a certain threshold or between a min and a max value. Please bear in mind that the result should be in any of these operations a DataFrame. For example the monthly averages should look like
x y a b
2010-01 0.0 0.0 0.45 -0.13
2010-02 0.1 0.0 0.55 -0.87
2010-03 0.1 0.1 0.24 -0.83
2010-04 0.0 0.1 0.11 -0.87
How do I build this MultiIndexed dataframe? I picture this like a timeseries of dataframes.
Upvotes: 2
Views: 1436
Reputation: 5493
Here is a different answer from my earlier one, in light of the more fully explained question. Iterate through the files and read them into pandas, parse the date and add it to the dataframe, then use set_index
to create your multiindex. Once you've got all your dataframes, use pd.concat
to combine them:
dataframes = []
for filename in filenames:
df = pd.read_csv(filename)
df["datetime"] = datetime.datetime.strptime(filename[8:18], "%Y%m%d%H")
dataframes.append(df.set_index(["datetime","x", "y"]))
combined_df = pd.concat(dataframes)
Upvotes: 1
Reputation: 5493
I would make a Panel then convert it into a multiindexed DataFrame using to_frame()
:
In [29]: df1 = pd.DataFrame(dict(a=[0.318705,-0.937012,-0.032225,-1.874781], b=[-0.871259,0.704270,-1.939544,-0.033073]))
In [30]: df2 = pd.DataFrame(dict(a=[0.318705,-0.937012,-0.032225,-1.874781], b=[-0.871259,0.704270,-1.939544,-0.033073]))
In [31]: df1
Out[31]:
a b
0 0.318705 -0.871259
1 -0.937012 0.704270
2 -0.032225 -1.939544
3 -1.874781 -0.033073
In [32]: data = {datetime.datetime(2010,6,21,10,0,0): df1, datetime.datetime(2010,6,22,10,0,0): df2}
In [33]: p = pd.Panel(data)
In [34]: p.to_frame()
Out[34]:
2010-06-21 10:00:00 2010-06-22 10:00:00
major minor
0 a 0.318705 0.318705
b -0.871259 -0.871259
1 a -0.937012 -0.937012
b 0.704270 0.704270
2 a -0.032225 -0.032225
b -1.939544 -1.939544
3 a -1.874781 -1.874781
b -0.033073 -0.033073
Depending on how you want to look at your data, you can use swapaxes
to rearrange it:
In [35]: p.swapaxes("major", "items").to_frame()
Out[35]:
0 1 2 3
major minor
2010-06-21 10:00:00 a 0.318705 -0.937012 -0.032225 -1.874781
b -0.871259 0.704270 -1.939544 -0.033073
2010-06-22 10:00:00 a 0.318705 -0.937012 -0.032225 -1.874781
b -0.871259 0.704270 -1.939544 -0.033073
Upvotes: 3