Converting multiple DataFrames into a Panel

Question

I have tons of excel files. Each of these files contains one or more variables for all subjects at a certain point in time. For each variable, I have, say, 10 files (storing the value of the variable at 10 different points in time). My ultimate goal is to set up a panel series.

Suppose there is only one variable in each file. For each variable (or item), I initialize an empty DataFrame item = pd.DataFrame() and successively read and append all 10 files into that empty DataFrame item = item.append(pd.DataFrame(df)), where df is from the new file. Each of those 10 DataFrames has dimension 1 x #subjects, thus I eventually have 10 x #subject. I turn this into a panel frame using pf = pd.Panel({'variable name': item}). Now, I can easily add this to a big panel frame with many other items...

Question: What is an easy and practical way to approach this problem if I have 2 or more variables in each file? If I stuck to the above approach, I would have a DataFrame of dimension #variables x #subjects for each file, leading to

                 subject1    subject2
variable1 2000        val         val
variable2 2000        val         val
variable1 2001        val         val
variable2 2001        val         val
...

after appending them. This is obviously a bad structure to convert this into panel data.

I could work myself around it - e.g. by appending to "the correct line" to keep the appropriate structure or reading the same file as many times as it has variables - but this would be cumbersome and/or costly. There have to be methods that do this work easily, but I couldn't find them in the docs.

Thanks for your help.

Alton Campbell · Accepted Answer

A Panel is essentially a stack of DataFrame objects, allowing the data to be explored in three dimensions. Thus, it does not matter how many variables or subjects are represented in each of your files, as long as each file represents only one point in time. Import each file into a DataFrame and then create your Panel.

This could be achieved by using a for loop over a list of your filenames. In your loop, you might check which year the data is from and store the results in a dictionary with all your other DataFrame objects, thus allowing you to easily convert your dictionary of dataframes into a panel.

If your original DataFrame format looks something like:

      Gerald  Kate
Var1       1     5
Var2       2     6
Var3       3     7
Var4       4     8

Then you can create your Panel with something like:

pn=pd.Panel(data={2010:df2010, 2015:df2015, 2020:df2020})

This yeilds a Panel with the properties:

Dimensions: 3 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: 2010 to 2020
Major_axis axis: Var1 to Var4
Minor_axis axis: Gerald to Kate

It is possible to slice by year:

print(pn[2015])


      Gerald  Kate
Var1       3    15
Var2       6    18
Var3       9    21
Var4      12    24

It is also possible to switch axes to get a better view of individual variables or subjects:

print(pn.transpose('minor_axis','major_axis','items')['Gerald'])

      2010  2015  2020
Var1     1     3     9
Var2     2     6    18
Var3     3     9    27
Var4     4    12    36

Converting multiple DataFrames into a Panel

Answers (1)

Related Questions