briba
briba

Reputation: 2987

Reading excel files in pandas

I am trying to read an Excel file using pandas but I am not sure if I am able to read the way I need.

My file is like this:

enter image description here

I am reading the file like this:

excel_file = pd.ExcelFile('MY_FILE')
df = excel_file.parse(sheet_name=0, header=1)

This way I am able to read but I am not sure from which group each variable belongs. In this case, for each column I need to know from which group they are coming from. Is there any way to do this?

Thank you!

Upvotes: 3

Views: 306

Answers (3)

jezrael
jezrael

Reputation: 863481

Here is possible specify first and second row in parameter header for MultiIndex in columns and index_col for index from first column in function read_excel:

df = pd.read_excel('file.xlsx', header=[0,1], index_col=[0], sheet_name=0)

Your solution should be changed with same parameters:

excel_file = pd.ExcelFile('file.xlsx')
df = excel_file.parse(header=[0,1], index_col=[0], sheet_name=0)

print (df)
CUSTOM NAME   g1      g2          
NAME           A    B  A    B    C
NAME 1       1.0  NaN  1  NaN  1.0
NAME 1       NaN  1.0  1  1.0  NaN

print (df.columns)
MultiIndex(levels=[['g1', 'g2'], ['A', 'B', 'C']],
           codes=[[0, 0, 1, 1, 1], [0, 1, 0, 1, 2]],
           names=['CUSTOM NAME', 'NAME'])

print (df.index)
Index(['NAME 1', 'NAME 1'], dtype='object')

Filtering working with tuples for select columns of MultiIndex:

print (df[df[('g1', 'A')] == 1])
CUSTOM NAME   g1     g2         
NAME           A   B  A   B    C
NAME 1       1.0 NaN  1 NaN  1.0

More information in Select rows in pandas MultiIndex DataFrame, only remove loc, because MultiIndex in columns.

Upvotes: 2

Valdi_Bo
Valdi_Bo

Reputation: 31011

Reading column names only from row 1 will probably lead to repeating column names, so such a call can end up with execution error.

It looks like you have a case of MultiIndex on columns:

  • GROUP 1 and GROUP 2 is the top level,
  • A, B and C is the subordinate level.

When reading Excel files, I usually use pd.read_excel method.

In your case, you probably should pass header=[0,1] parameter, specifying which rows contain column names.

So try something like:

df = pd.read_excel('MY_FILE.xls', header=[0,1])

You can also omit sheet_name=0, as this is the default value (read from the first sheet, sheet numbering is from 0).

I'm also not sure whether you can omit file extension (either xls or xlsx), so in the above proposal I wrote it explicitely.

Upvotes: 1

PythonSherpa
PythonSherpa

Reputation: 2600

You can use MultiIndex if you pass a list of integers to header:

excel_file = pd.ExcelFile('example.xlsx')
df = excel_file.parse(sheet_name=0, header=[0,1])

the dataframe:

CUSTOM NAME     GROUP 1     GROUP 2
NAME            A   B       A   B   C
NAME 1          1.0 NaN     1   NaN 1.0
NAME 2          NaN 1.0     1   1.0 NaN

Documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Upvotes: 2

Related Questions