Reputation: 6616
I have an excel sheet like this:
Roughly this as CSV:
,,,,,,,
,,,,,,,
,baseline,,,,scenario1,,
Year,A value,Another value,Etc.,Year,A value,Another value,Etc.
1900,4.74,7.08,0.00,1900,4.74,7.08,293.72
...
In reality there are more columns but they always have the same names. There are also a multiple but unknown number of scenarios, all with the same column names. I'm trying to read this in Pandas. Right now I don't need to do any calculations (though this may change). I mostly just want a nice way to navigate it.
So I've taken out the duplicate Year
columns as they will always be the same, and set the Year
as the index, which all works well, but for now I'm skipping the first header (with baseline
, scenario1
, etc.). I want some way of dealing with each year but being able to grab the value for each scenario out, for example "I want the value the Another value
column for all scenarios for the Year
2013
.
Later I might want to get averages (e.g. average Another value
across all scenarios) or do other similar calculations.
So far I have:
df = pd.read_excel(
"sample_data.xlsx",
index_col=0,
skiprows=3,
usecols=lambda x: not x.startswith("Year."),
)
Which gives me this:
A value Another value Etc. A value.1 Another value.1 Etc..1
Year
1900 4.738148 7.079923 0.000000 4.738148 7.079923 293.723450
1901 4.813111 7.179976 0.000000 4.813111 7.179976 294.029846
1902 4.881362 7.281056 0.000000 4.881362 7.281056 294.335663
1903 5.073858 7.503372 0.000000 5.073858 7.503372 294.586121
1904 5.104264 7.569910 0.000000 5.104264 7.569910 294.732697
... ... ... ... ... ... ...
2096 82.631828 109.509178 1.743965 -31.377947 -20.046871 337.782776
2097 82.983414 109.944275 1.749330 -31.621614 -20.278734 335.790863
2098 83.330070 110.372986 1.754694 -31.862324 -20.508474 333.807678
2099 83.671852 110.795380 1.760058 -32.102043 -20.738163 331.832764
2100 84.008766 111.211464 1.765422 -32.339836 -20.966856 329.865570
This is tolerable but I'd really like to have them "grouped" somehow by the scenario name (baseline, scenario1, any any others). I suppose what I'd like to be able to do is something like df["2020"]["A value"]
or the other way around, but I'm not sure how to do this, or if it's even the right approach. I feel like this would require a MultiIndex, but I'm not sure how to set that up with this data.
Here is the real structure, without data, saved to CSV:
,baseline,,,,,,,,,,,,,,,scenario1,,,,,,,,,,,,,
Time,"Global CO2 emissions
(GtonsCO2/year)","Global CO2 Equivalent Emissions
(GtonsCO2/year)","Atm conc CO2
(ppm)","Equivalent CO2
(ppm)","Temperature change from preindustrial
(Degrees C)",Sea Level Rise from 2000 (mm),"Global cumulative CO2
(GtonsCO2)","Global cumulative C
(GtonsC)","Budget of C remaining
(GtonsC)","Global CH4 anthro emissions
(Mtons/year)","Global N2O Anthro Emissions
(Mtons/year)","Global SF6 emissions
(tons/year)","Global PFC emissions
(tons/year)","Global CO2eq emissions from HFC
(GtonsCO2e/year)",Time,"Global CO2 Emissions
(GtonsCO2/year)","Global CO2 Equivalent Emissions
(GtonsCO2/year)","Atm conc CO2
(ppm)","Equivalent CO2
(ppm)","Temperature change from preindustrial
(Degrees C)",Sea Level Rise from 2000 (mm),"Global cumulative CO2
(GtonsCO2)","Global cumulative C
(GtonsC)","Budget of C remaining
(GtonsC)","Global CH4 anthro emissions
(Mtons/year)","Global N2O Anthro Emissions
(Mtons/year)","Global SF6 emissions
(tons/year)","Global PFC emissions
(tons/year)","Global CO2eq emissions from HFC
(GtonsCO2e/year)"
1900,,,,,,,,,,,,,,,1900,,,,,,,,,,,,,,
1901,,,,,,,,,,,,,,,1901,,,,,,,,,,,,,,
1902,,,,,,,,,,,,,,,1902,,,,,,,,,,,,,,
1903,,,,,,,,,,,,,,,1903,,,,,,,,,,,,,,
1904,,,,,,,,,,,,,,,1904,,,,,,,,,,,,,,
There are annoyingly newlines in the headers, but they do seem to get picked up correctly in Pandas.
Upvotes: 2
Views: 114
Reputation: 2049
This is possible, but a little tricky as a result of the non-rectangularity of your data. Pandas allows data to be read with multiple levels of columns, but your issue is that your first level of header has cells missing (i.e. 'baseline' does not appear in cell C3.) Pandas will fill these missing cells with an 'Unnamed' column, but won't recognize that these unnamed columns should be 'baseline'.
In order to mitigate this, we'll have to rename the columns levels. Then we'll set the index to be year, and drop this column from all the scenarios:
First, we read the file without setting the index, and without skipping columns:
df = pd.read_excel("sample_data.xlsx", header=[2, 3])
Unnamed: 0_level_0 baseline Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 scenario1 Unnamed: 6_level_0 Unnamed: 7_level_0
Year A value Another value Etc. Year A value Another value Etc.
0 1900 4.74 7.08 0.0 1900 4.74 7.08 293.72
Then, we build lists of the two levels of the columns so that we can assign them later:
scenarios = [col for col in df.columns.get_level_values(0) if not col.startswith('Unnamed')]
values = []
for col in df.columns.get_level_values(1):
if col not in values:
values.append(col)
This gives us multiindex column levels of:
scenarios
['baseline', 'scenario1']
values
['Year', 'A value', 'Another value', 'Etc.']
Now we assign the new multiindex columns using these level lists:
df.columns = df.columns.from_product([scenarios, values], names=['scenario','value'])
scenario baseline scenario1
value Year A value Another value Etc. Year A value Another value Etc.
0 1900 4.74 7.08 0.0 1900 4.74 7.08 293.72
Finally, we assign the baseline year as the index and drop year from all the scenarios:
df = df.set_index(('baseline', 'Year'), drop=False)
df = df.drop('Year', axis=1, level=1)
scenario baseline scenario1
value A value Another value Etc. A value Another value Etc.
(baseline, Year)
1900 4.74 7.08 0.0 4.74 7.08 293.72
One final note: as a general rule, keep your data rectangular! If you do that, automated processing will be a lot easier.
EDIT: The issue you're having is that you have two columns which are supposed to be the same, but are spelled slightly differently:
Column 1:
Global CO2 emissions
(GtonsCO2/year)
Column 16:
Global CO2 Emissions
(GtonsCO2/year)
To handle cases like this, you'll want to select the second level of the columns in a different way:
df = pd.read_excel("sample_data.xlsx", header=[0, 1])
scenarios = [col for col in df.columns.get_level_values(0) if not col.startswith('Unnamed')]
# Instead of taking one of each unique column names, we take the first n columns,
# where n is the total number of columns divided by the number of scenarios
values = df.columns.get_level_values(1)[:int(len(df.columns) / len(scenarios))]
df.columns = df.columns.from_product([scenarios, values], names=['scenario','value'])
df = df.set_index(('baseline', 'Time'), drop=False)
df = df.drop('Time', axis=1, level=1)
scenario baseline ... scenario1
value Global CO2 emissions\n(GtonsCO2/year) Global CO2 Equivalent Emissions\n(GtonsCO2/year) ... Global PFC emissions\n(tons/year) Global CO2eq emissions from HFC\n(GtonsCO2e/year)
(baseline, Time) ...
1900 NaN NaN ... NaN NaN
1901 NaN NaN ... NaN NaN
Upvotes: 1