Reputation: 2987
I am trying to read an Excel file using pandas but I am not sure if I am able to read the way I need.
My file is like this:
I am reading the file like this:
excel_file = pd.ExcelFile('MY_FILE')
df = excel_file.parse(sheet_name=0, header=1)
This way I am able to read but I am not sure from which group each variable belongs. In this case, for each column I need to know from which group they are coming from. Is there any way to do this?
Thank you!
Upvotes: 3
Views: 306
Reputation: 863481
Here is possible specify first and second row in parameter header
for MultiIndex
in columns and index_col
for index from first column in function read_excel
:
df = pd.read_excel('file.xlsx', header=[0,1], index_col=[0], sheet_name=0)
Your solution should be changed with same parameters:
excel_file = pd.ExcelFile('file.xlsx')
df = excel_file.parse(header=[0,1], index_col=[0], sheet_name=0)
print (df)
CUSTOM NAME g1 g2
NAME A B A B C
NAME 1 1.0 NaN 1 NaN 1.0
NAME 1 NaN 1.0 1 1.0 NaN
print (df.columns)
MultiIndex(levels=[['g1', 'g2'], ['A', 'B', 'C']],
codes=[[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]],
names=['CUSTOM NAME', 'NAME'])
print (df.index)
Index(['NAME 1', 'NAME 1'], dtype='object')
Filtering working with tuples for select columns of MultiIndex
:
print (df[df[('g1', 'A')] == 1])
CUSTOM NAME g1 g2
NAME A B A B C
NAME 1 1.0 NaN 1 NaN 1.0
More information in Select rows in pandas MultiIndex DataFrame
, only remove loc
, because MultiIndex
in columns.
Upvotes: 2
Reputation: 31011
Reading column names only from row 1 will probably lead to repeating column names, so such a call can end up with execution error.
It looks like you have a case of MultiIndex on columns:
When reading Excel files, I usually use pd.read_excel method.
In your case, you probably should pass header=[0,1] parameter, specifying which rows contain column names.
So try something like:
df = pd.read_excel('MY_FILE.xls', header=[0,1])
You can also omit sheet_name=0, as this is the default value (read from the first sheet, sheet numbering is from 0).
I'm also not sure whether you can omit file extension (either xls or xlsx), so in the above proposal I wrote it explicitely.
Upvotes: 1
Reputation: 2600
You can use MultiIndex
if you pass a list of integers to header
:
excel_file = pd.ExcelFile('example.xlsx')
df = excel_file.parse(sheet_name=0, header=[0,1])
the dataframe:
CUSTOM NAME GROUP 1 GROUP 2
NAME A B A B C
NAME 1 1.0 NaN 1 NaN 1.0
NAME 2 NaN 1.0 1 1.0 NaN
Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html
Upvotes: 2