Reputation: 604
I'm having trouble describing the way my data is structured, hopefully this image will help.
As you can see I have 3 columns: Name, Corp #, and Dissagregation Category, that together serve as a unique identifier for each row. There are two super headers, "English/Language Arts Scores", and "Mathematics Scores." I want to create two separate dataframes for each header. This is what I've managed so far:
df1 = pd.read_excel(file, header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'Name'])
df = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)
This generates a dataframe with two levels: df['English/Language Arts Scores']
and df['Mathematics Scores']
, that contain the columns under the header and "Name" as the index. I want these dataframes to include: Name, Copr #, and Dissagregation Category as columns or as the index.
What would be the best way to go about doing this?
EDIT
Here is a copy and pasteable snippet of my data:
English Math
A B C X Y X Y
ADAMS CENTRAL 0015 All Students 83 590 83 579
ADAMS CENTRAL 0015 General 1 0 *** 0 ***
ADAMS CENTRAL 0015 Total Gene 71 590 71 579
ADAMS West 0016 All Students 93 440 83 765
ADAMS West 0016 General 1 1 33 0 660
ADAMS West 0016 Total Gene 31 *** 46 572
The output that I want should look like this:
In [1]: df['English']
Out[1]:
A B C X Y
ADAMS CENTRAL 0015 All Students 83 590
ADAMS CENTRAL 0015 General 1 0 ***
ADAMS CENTRAL 0015 Total Gene 71 590
ADAMS West 0016 All Students 93 440
ADAMS West 0016 General 1 1 33
ADAMS West 0016 Total Gene 31 ***
In [2]: df['Math']
Out[2]:
A B C X Y
ADAMS CENTRAL 0015 All Students 83 579
ADAMS CENTRAL 0015 General 1 0 ***
ADAMS CENTRAL 0015 Total Gene 71 579
ADAMS West 0016 All Students 83 765
ADAMS West 0016 General 1 0 660
ADAMS West 0016 Total Gene 46 572
Upvotes: 1
Views: 2098
Reputation: 323226
We can using read_excel
, read the file, then using .loc
, slice the columns level0 you need
df = pd.read_excel('yourexcel.xlsx',
header=[0,1],
index_col=[0,1,2],
sheetname="Sheet1")
df.loc[:,'English']
Out[837]:
C X Y
ADAMS CENTRAL 15 All Students 83 590
General 1 0 ***
Total Gene 71 590
ADAMS West 16 All Students 93 440
General 1 1 33
Total Gene 31 ***
To make it more neat add rename_axis
df.loc[:,'English'].rename_axis(None,1)
Out[840]:
X Y
ADAMS CENTRAL 15 All Students 83 590
General 1 0 ***
Total Gene 71 590
ADAMS West 16 All Students 93 440
General 1 1 33
Total Gene 31 ***
Our base on your method
vals = df.iloc[3:,:].values
df1 = pd.DataFrame(df.values[3:, 3:], df.values[3:, 0:3])
mux = pd.MultiIndex.from_arrays(df.ffill().ffill(1).values[1:3, 3:])
df1.columns=mux
df1.index = pd.MultiIndex.from_tuples(df1.index)
Upvotes: 1