MorrisseyJ
MorrisseyJ

Reputation: 1271

How to load a text file of data with many commented rows, into pandas?

I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table. If I explicitly set sep = ' ', I get an error: Error tokenizing data. C error. Notably the defaults work when I use np.loadtxt().

Example:

pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
              comment = '%',
              header = None)
    0
0   1850 1 -0.777 0.412 NaN NaN...
1   1850 2 -0.239 0.458 NaN NaN...
2   1850 3 -0.426 0.447 NaN NaN...
3   1850 4 -0.680 0.367 NaN NaN...
4   1850 5 -0.687 0.298 NaN NaN...

If I set sep = ' ', I get another error:

pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
              comment = '%',
              header = None,
              sep = ' ')

ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58

Looking up this error, people suggest using header = None (already done) and setting sep = explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False i get an empty df suggesting there is a problem with every entry.

Notably this works when I use np.loadtxt():

pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt', 
                        comments = '%'))
    0   1   2   3   4   5   6   7   8   9   10  11
0   1850.0  1.0     -0.777  0.412   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
1   1850.0  2.0     -0.239  0.458   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   1850.0  3.0     -0.426  0.447   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
3   1850.0  4.0     -0.680  0.367   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   1850.0  5.0     -0.687  0.298   NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN

This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table(). I looked through the documentation for np.loadtxt() in the hope of setting the sep to the same value, but that just shows: delimiter=None (https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).

I'd prefer to be able to import this as a pd.DataFrame, setting the names, rather than having to import as a matrix and then convert to pd.DataFrame.

What am I getting wrong?

Upvotes: 1

Views: 529

Answers (2)

Trenton McKinney
Trenton McKinney

Reputation: 62413

  • The issue is the file has 77 rows of commented text, for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'
    • Two of the rows are headers
  • There's a bunch of data, then there are two more headers, and a new set of data for 'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
  • This solution separates the two tables in the file into separate dataframes.
  • This is not as nice as the other answer, but the data is properly separated into different dataframes.
  • The headers were a pain, it would probably be easier to manually create a custom header, and skip the lines of code for separating the headers from the text.
  • The important point separating air and ice data.
import requests
import pandas as pd
import math

# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text

# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]

# specify the data from the ranges in the file
air_header1 = data[74].split()  # not used
air_header2 = [v.strip() for v in data[75].split(',')]

# combine the 2 parts of the header into a single header
air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]
air_data = [v.split() for v in data[77:2125]]

h2o_header1 = data[2129].split()  # not used
h2o_header2 = [v.strip() for v in data[2130].split(',')]

# combine the 2 parts of the header into a single header
h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]
h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes
air = pd.DataFrame(air_data, columns=air_header)
h2o = pd.DataFrame(h2o_data, columns=h2o_header)

Without the header code

  • Simplify the code, by using a manual header list.
import pandas as pd
import requests

# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text

# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]

# manually created header
headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',
           'Annual_Anomaly', 'Annual_Unc.',
           'Five-year_Anomaly', 'Five-year_Unc.',
           'Ten-year_Anomaly', 'Ten-year_Unc.',
           'Twenty-year_Anomaly', 'Twenty-year_Unc.']

# separate the air and h2o data
air_data = [v.split() for v in data[77:2125]]
h2o_data = [v.split() for v in data[2132:4180]]

# create the dataframes
air = pd.DataFrame(air_data, columns=headers)
h2o = pd.DataFrame(h2o_data, columns=headers)

air

   Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0  1850     1          -0.777        0.412            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
1  1850     2          -0.239        0.458            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
2  1850     3          -0.426        0.447            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN

h2o

   Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0  1850     1          -0.724        0.370            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
1  1850     2          -0.221        0.430            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN
2  1850     3          -0.443        0.419            NaN         NaN               NaN            NaN              NaN           NaN                 NaN              NaN

Upvotes: 2

tags
tags

Reputation: 4060

This one is quite tricky. Please try out the snippet code below:

import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
                 sep='\s+',
                 comment='%',
                 usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
                 names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',                                           
                        'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
                        '20y.Anomaly', '20y.Unc.'))

Upvotes: 2

Related Questions