Reputation: 618
I have tons of excel files. Each of these files contains one or more variables for all subjects at a certain point in time. For each variable, I have, say, 10 files (storing the value of the variable at 10 different points in time). My ultimate goal is to set up a panel series.
Suppose there is only one variable in each file. For each variable (or item), I initialize an empty DataFrame item = pd.DataFrame()
and successively read and append all 10 files into that empty DataFrame item = item.append(pd.DataFrame(df))
, where df is from the new file. Each of those 10 DataFrames has dimension 1 x #subjects
, thus I eventually have 10 x #subject
. I turn this into a panel frame using pf = pd.Panel({'variable name': item})
. Now, I can easily add this to a big panel frame with many other items...
Question: What is an easy and practical way to approach this problem if I have 2 or more variables in each file? If I stuck to the above approach, I would have a DataFrame of dimension #variables x #subjects
for each file, leading to
subject1 subject2
variable1 2000 val val
variable2 2000 val val
variable1 2001 val val
variable2 2001 val val
...
after appending them. This is obviously a bad structure to convert this into panel data.
I could work myself around it - e.g. by appending to "the correct line" to keep the appropriate structure or reading the same file as many times as it has variables - but this would be cumbersome and/or costly. There have to be methods that do this work easily, but I couldn't find them in the docs.
Thanks for your help.
Upvotes: 0
Views: 1862
Reputation: 226
A Panel
is essentially a stack of DataFrame
objects, allowing the data to be explored in three dimensions. Thus, it does not matter how many variables or subjects are represented in each of your files, as long as each file represents only one point in time. Import each file into a DataFrame
and then create your Panel
.
This could be achieved by using a for loop over a list of your filenames. In your loop, you might check which year the data is from and store the results in a dictionary with all your other DataFrame
objects, thus allowing you to easily convert your dictionary of dataframes into a panel.
If your original DataFrame
format looks something like:
Gerald Kate
Var1 1 5
Var2 2 6
Var3 3 7
Var4 4 8
Then you can create your Panel
with something like:
pn=pd.Panel(data={2010:df2010, 2015:df2015, 2020:df2020})
This yeilds a Panel
with the properties:
Dimensions: 3 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: 2010 to 2020
Major_axis axis: Var1 to Var4
Minor_axis axis: Gerald to Kate
It is possible to slice by year:
print(pn[2015])
Gerald Kate
Var1 3 15
Var2 6 18
Var3 9 21
Var4 12 24
It is also possible to switch axes to get a better view of individual variables or subjects:
print(pn.transpose('minor_axis','major_axis','items')['Gerald'])
2010 2015 2020
Var1 1 3 9
Var2 2 6 18
Var3 3 9 27
Var4 4 12 36
Upvotes: 2