How to load a text file of data with many commented rows, into pandas?

Question

I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table. If I explicitly set sep = ' ', I get an error: Error tokenizing data. C error. Notably the defaults work when I use np.loadtxt().

Example:

pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
              comment = '%',
              header = None)
    0
0   1850 1 -0.777 0.412 NaN NaN...
1   1850 2 -0.239 0.458 NaN NaN...
2   1850 3 -0.426 0.447 NaN NaN...
3   1850 4 -0.680 0.367 NaN NaN...
4   1850 5 -0.687 0.298 NaN NaN...

If I set sep = ' ', I get another error:

pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
              comment = '%',
              header = None,
              sep = ' ')

ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58

Looking up this error, people suggest using header = None (already done) and setting sep = explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False i get an empty df suggesting there is a problem with every entry.

Notably this works when I use np.loadtxt():

pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', 
                        comments = '%'))
    0   1   2   3   4   5   6   7   8   9   10  11
0   1850.0  1.0     -0.777  0.412   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   1850.0  2.0     -0.239  0.458   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   1850.0  3.0     -0.426  0.447   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
3   1850.0  4.0     -0.680  0.367   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   1850.0  5.0     -0.687  0.298   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN

This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table(). I looked through the documentation for np.loadtxt() in the hope of setting the sep to the same value, but that just shows: delimiter=None (https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).

I'd prefer to be able to import this as a pd.DataFrame, setting the names, rather than having to import as a matrix and then convert to pd.DataFrame.

What am I getting wrong?

tags · Accepted Answer

This one is quite tricky. Please try out the snippet code below:

import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
                 sep='\s+',
                 comment='%',
                 usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
                 names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',                                           
                        'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
                        '20y.Anomaly', '20y.Unc.'))

How to load a text file of data with many commented rows, into pandas?

Answers (2)

Without the header code

`air`

`h2o`

Related Questions