otteheng
otteheng

Reputation: 604

Multiple columns as the index for each level in a MultiIndex dataframe

I'm having trouble describing the way my data is structured, hopefully this image will help. enter image description here

As you can see I have 3 columns: Name, Corp #, and Dissagregation Category, that together serve as a unique identifier for each row. There are two super headers, "English/Language Arts Scores", and "Mathematics Scores." I want to create two separate dataframes for each header. This is what I've managed so far:

df1 = pd.read_excel(file, header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'Name'])
df = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)

This generates a dataframe with two levels: df['English/Language Arts Scores'] and df['Mathematics Scores'], that contain the columns under the header and "Name" as the index. I want these dataframes to include: Name, Copr #, and Dissagregation Category as columns or as the index.

What would be the best way to go about doing this?

EDIT

Here is a copy and pasteable snippet of my data:

                                          English        Math   
         A        B          C           X       Y     X       Y
 ADAMS CENTRAL  0015    All Students    83      590    83     579
 ADAMS CENTRAL  0015    General 1       0       ***    0      ***
 ADAMS CENTRAL  0015    Total Gene      71      590    71     579
 ADAMS West     0016    All Students    93      440    83     765
 ADAMS West     0016    General 1       1        33    0      660
 ADAMS West     0016    Total Gene      31      ***    46     572

The output that I want should look like this:

In [1]: df['English']
Out[1]:                                              
         A        B          C           X       Y     
 ADAMS CENTRAL  0015    All Students    83      590    
 ADAMS CENTRAL  0015    General 1       0       ***    
 ADAMS CENTRAL  0015    Total Gene      71      590    
 ADAMS West     0016    All Students    93      440    
 ADAMS West     0016    General 1       1        33    
 ADAMS West     0016    Total Gene      31      ***  

In [2]: df['Math']
Out[2]:                                                 
         A        B          C           X       Y
 ADAMS CENTRAL  0015    All Students    83      579
 ADAMS CENTRAL  0015    General 1       0       ***
 ADAMS CENTRAL  0015    Total Gene      71      579
 ADAMS West     0016    All Students    83      765
 ADAMS West     0016    General 1       0       660
 ADAMS West     0016    Total Gene      46      572

Upvotes: 1

Views: 2098

Answers (1)

BENY
BENY

Reputation: 323226

We can using read_excel, read the file, then using .loc, slice the columns level0 you need

df = pd.read_excel('yourexcel.xlsx',
                   header=[0,1],
                   index_col=[0,1,2],
                   sheetname="Sheet1")

df.loc[:,'English']
Out[837]: 
C                               X    Y
ADAMS CENTRAL 15 All Students  83  590
                 General 1      0  ***
                 Total Gene    71  590
ADAMS West    16 All Students  93  440
                 General 1      1   33
                 Total Gene    31  ***

To make it more neat add rename_axis

df.loc[:,'English'].rename_axis(None,1)
Out[840]: 
                                X    Y
ADAMS CENTRAL 15 All Students  83  590
                 General 1      0  ***
                 Total Gene    71  590
ADAMS West    16 All Students  93  440
                 General 1      1   33
                 Total Gene    31  ***

Our base on your method

vals = df.iloc[3:,:].values
df1 = pd.DataFrame(df.values[3:, 3:], df.values[3:, 0:3])
mux = pd.MultiIndex.from_arrays(df.ffill().ffill(1).values[1:3, 3:])
df1.columns=mux
df1.index = pd.MultiIndex.from_tuples(df1.index)

Upvotes: 1

Related Questions