How to process excel file headers using pandas/python

Question

I am trying to read https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx using pandas.

Having saved it my script is

import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
#    print xl.sheet_names
df = xl.parse(xl.sheet_names[0])
print df.head()

However this does not seem to process the headers properly as it gives

  GCSE and IGCSE1 results2,3 in selected subjects4 of pupils at the end of key stage 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0                              Year: 2010/11 (Final)                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
1                                  Coverage: England                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
2                                                NaN                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
3  1. Includes International GCSE, Cambridge Inte...                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
4  2. Includes attempts and achievements by these...                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN

All of this should be treated as comments.

If you load the spreadsheet into libreoffice, for example, you can see that the column headings are correctly parsed and appear in row 15 with drop down menus to let you select the items you want.

How can you get pandas to automatically detect where the column headers are just as libreoffice does?

DSM · Accepted Answer

pandas is (are?) processing the file correctly, and exactly the way you asked it (them?) to. You didn't specify a header value, which means that it defaults to picking up the column names from the 0th row. The first few rows of cells aren't comments in some fundamental way, they're just not cells you're interested in.

Simply tell parse you want to skip some rows:

>>> xl = pd.ExcelFile("GCSE IGCSE results v3.xlsx")
>>> df = xl.parse(xl.sheet_names[0], skiprows=14)
>>> df.columns
Index([u'Local Authority Number', u'Local Authority Name', u'Local Authority Establishment Number', u'Unique Reference Number', u'School Name', u'Town', u'Number of pupils at the end of key stage 4', u'Number of pupils attempting a GCSE or an IGCSE', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-G', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-A', u'Number of students achieving 5 A*-A grades or more at GCSE or IGCSE'], dtype='object')
>>> df.head()
   Local Authority Number Local Authority Name  \
0                     201       City of london   
1                     201       City of london   
2                     202               Camden   
3                     202               Camden   
4                     202               Camden   

   Local Authority Establishment Number  Unique Reference Number  \
0                               2016005                   100001   
1                               2016007                   100003   
2                               2024104                   100049   
3                               2024166                   100050   
4                               2024196                   100051   

                       School Name    Town  \
0  City of London School for Girls  London   
1            City of London School  London   
2                Haverstock School  London   
3           Parliament Hill School  London   
4               Regent High School  London   

  Number of pupils at the end of key stage 4  \
0                                        105   
1                                        140   
2                                        200   
3                                        172   
4                                        174   

  Number of pupils attempting a GCSE or an IGCSE  \
0                                            104   
1                                            140   
2                                            194   
3                                            169   
4                                            171   

  Number of students achieving 8 or more GCSE or IGCSE passes at A*-G  \
0                                                100                    
1                                                108                    
2                                               SUPP                    
3                                                 22                    
4                                                  0                    

  Number of students achieving 8 or more GCSE or IGCSE passes at A*-A  \
0                                                 87                    
1                                                 75                    
2                                                  0                    
3                                                  7                    
4                                                  0                    

  Number of students achieving 5 A*-A grades or more at GCSE or IGCSE  
0                                                100                   
1                                                123                   
2                                                  0                   
3                                                 34                   
4                                               SUPP                    

[5 rows x 11 columns]

How to process excel file headers using pandas/python

Answers (1)

Related Questions