Reading Data with Groups of Columns using Pandas

Question

I have an excel sheet like this:

Roughly this as CSV:

,,,,,,,
,,,,,,,
,baseline,,,,scenario1,,
Year,A value,Another value,Etc.,Year,A value,Another value,Etc.
1900,4.74,7.08,0.00,1900,4.74,7.08,293.72
...

In reality there are more columns but they always have the same names. There are also a multiple but unknown number of scenarios, all with the same column names. I'm trying to read this in Pandas. Right now I don't need to do any calculations (though this may change). I mostly just want a nice way to navigate it.

So I've taken out the duplicate Year columns as they will always be the same, and set the Year as the index, which all works well, but for now I'm skipping the first header (with baseline, scenario1, etc.). I want some way of dealing with each year but being able to grab the value for each scenario out, for example "I want the value the Another value column for all scenarios for the Year 2013.

Later I might want to get averages (e.g. average Another value across all scenarios) or do other similar calculations.

So far I have:

df = pd.read_excel(
    "sample_data.xlsx",
    index_col=0,
    skiprows=3,
    usecols=lambda x: not x.startswith("Year."),
)

Which gives me this:

        A value  Another value      Etc.  A value.1  Another value.1      Etc..1
Year                                                                            
1900   4.738148       7.079923  0.000000   4.738148         7.079923  293.723450
1901   4.813111       7.179976  0.000000   4.813111         7.179976  294.029846
1902   4.881362       7.281056  0.000000   4.881362         7.281056  294.335663
1903   5.073858       7.503372  0.000000   5.073858         7.503372  294.586121
1904   5.104264       7.569910  0.000000   5.104264         7.569910  294.732697
...         ...            ...       ...        ...              ...         ...
2096  82.631828     109.509178  1.743965 -31.377947       -20.046871  337.782776
2097  82.983414     109.944275  1.749330 -31.621614       -20.278734  335.790863
2098  83.330070     110.372986  1.754694 -31.862324       -20.508474  333.807678
2099  83.671852     110.795380  1.760058 -32.102043       -20.738163  331.832764
2100  84.008766     111.211464  1.765422 -32.339836       -20.966856  329.865570

This is tolerable but I'd really like to have them "grouped" somehow by the scenario name (baseline, scenario1, any any others). I suppose what I'd like to be able to do is something like df["2020"]["A value"] or the other way around, but I'm not sure how to do this, or if it's even the right approach. I feel like this would require a MultiIndex, but I'm not sure how to set that up with this data.

Here is the real structure, without data, saved to CSV:

,baseline,,,,,,,,,,,,,,,scenario1,,,,,,,,,,,,,
Time,"Global CO2 emissions
(GtonsCO2/year)","Global CO2 Equivalent Emissions
(GtonsCO2/year)","Atm conc CO2
(ppm)","Equivalent CO2
(ppm)","Temperature change from preindustrial
(Degrees C)",Sea Level Rise from 2000 (mm),"Global cumulative CO2
(GtonsCO2)","Global cumulative C
(GtonsC)","Budget of C remaining
(GtonsC)","Global CH4 anthro emissions
(Mtons/year)","Global N2O Anthro Emissions
(Mtons/year)","Global SF6 emissions
(tons/year)","Global PFC emissions
(tons/year)","Global CO2eq emissions from HFC
(GtonsCO2e/year)",Time,"Global CO2 Emissions
(GtonsCO2/year)","Global CO2 Equivalent Emissions
(GtonsCO2/year)","Atm conc CO2
(ppm)","Equivalent CO2
(ppm)","Temperature change from preindustrial
(Degrees C)",Sea Level Rise from 2000 (mm),"Global cumulative CO2
(GtonsCO2)","Global cumulative C
(GtonsC)","Budget of C remaining
(GtonsC)","Global CH4 anthro emissions
(Mtons/year)","Global N2O Anthro Emissions
(Mtons/year)","Global SF6 emissions
(tons/year)","Global PFC emissions
(tons/year)","Global CO2eq emissions from HFC
(GtonsCO2e/year)"
1900,,,,,,,,,,,,,,,1900,,,,,,,,,,,,,,
1901,,,,,,,,,,,,,,,1901,,,,,,,,,,,,,,
1902,,,,,,,,,,,,,,,1902,,,,,,,,,,,,,,
1903,,,,,,,,,,,,,,,1903,,,,,,,,,,,,,,
1904,,,,,,,,,,,,,,,1904,,,,,,,,,,,,,,

There are annoyingly newlines in the headers, but they do seem to get picked up correctly in Pandas.

Dave · Accepted Answer

This is possible, but a little tricky as a result of the non-rectangularity of your data. Pandas allows data to be read with multiple levels of columns, but your issue is that your first level of header has cells missing (i.e. 'baseline' does not appear in cell C3.) Pandas will fill these missing cells with an 'Unnamed' column, but won't recognize that these unnamed columns should be 'baseline'.

In order to mitigate this, we'll have to rename the columns levels. Then we'll set the index to be year, and drop this column from all the scenarios:

First, we read the file without setting the index, and without skipping columns:

df = pd.read_excel("sample_data.xlsx", header=[2, 3])

  Unnamed: 0_level_0 baseline Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 scenario1 Unnamed: 6_level_0 Unnamed: 7_level_0
                Year  A value      Another value               Etc.               Year   A value      Another value               Etc.
0               1900     4.74               7.08                0.0               1900      4.74               7.08             293.72

Then, we build lists of the two levels of the columns so that we can assign them later:

scenarios = [col for col in df.columns.get_level_values(0) if not col.startswith('Unnamed')]
values = []
for col in df.columns.get_level_values(1):
    if col not in values:
        values.append(col)

This gives us multiindex column levels of:

scenarios
['baseline', 'scenario1']

values                                                                                                                                                                                              

['Year', 'A value', 'Another value', 'Etc.']

Now we assign the new multiindex columns using these level lists:

df.columns = df.columns.from_product([scenarios, values], names=['scenario','value'])

scenario baseline                            scenario1                              
value        Year A value Another value Etc.      Year A value Another value    Etc.
0            1900    4.74          7.08  0.0      1900    4.74          7.08  293.72

Finally, we assign the baseline year as the index and drop year from all the scenarios:

df = df.set_index(('baseline', 'Year'), drop=False)
df = df.drop('Year', axis=1, level=1)

scenario         baseline                    scenario1                      
value             A value Another value Etc.   A value Another value    Etc.
(baseline, Year)                                                            
1900                 4.74          7.08  0.0      4.74          7.08  293.72

One final note: as a general rule, keep your data rectangular! If you do that, automated processing will be a lot easier.

EDIT: The issue you're having is that you have two columns which are supposed to be the same, but are spelled slightly differently:

Column 1:

Global CO2 emissions
(GtonsCO2/year)

Column 16:

Global CO2 Emissions
(GtonsCO2/year)

To handle cases like this, you'll want to select the second level of the columns in a different way:

df = pd.read_excel("sample_data.xlsx", header=[0, 1])

scenarios = [col for col in df.columns.get_level_values(0) if not col.startswith('Unnamed')]

# Instead of taking one of each unique column names, we take the first n columns,
# where n is the total number of columns divided by the number of scenarios
values = df.columns.get_level_values(1)[:int(len(df.columns) / len(scenarios))]

df.columns = df.columns.from_product([scenarios, values], names=['scenario','value'])
df = df.set_index(('baseline', 'Time'), drop=False)
df = df.drop('Time', axis=1, level=1)

scenario                                      baseline                                                   ...                         scenario1                                                  
value            Global CO2 emissions
(GtonsCO2/year) Global CO2 Equivalent Emissions
(GtonsCO2/year)  ... Global PFC emissions
(tons/year) Global CO2eq emissions from HFC
(GtonsCO2e/year)
(baseline, Time)                                                                                         ...                                                                                    
1900                                               NaN                                              NaN  ...                               NaN                                               NaN
1901                                               NaN                                              NaN  ...                               NaN                                               NaN

Reading Data with Groups of Columns using Pandas

Answers (1)

Related Questions