Reputation: 1271
I am trying to read a deliminated text file into a dataframe in python. The deliminator is not being identified when I use pd.read_table
. If I explicitly set sep = ' '
, I get an error: Error tokenizing data. C error
. Notably the defaults work when I use np.loadtxt()
.
Example:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None)
0
0 1850 1 -0.777 0.412 NaN NaN...
1 1850 2 -0.239 0.458 NaN NaN...
2 1850 3 -0.426 0.447 NaN NaN...
3 1850 4 -0.680 0.367 NaN NaN...
4 1850 5 -0.687 0.298 NaN NaN...
If I set sep = ' ', I get another error:
pd.read_table('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comment = '%',
header = None,
sep = ' ')
ParserError: Error tokenizing data. C error: Expected 2 fields in line 78, saw 58
Looking up this error, people suggest using header = None
(already done) and setting sep =
explicitly, but that is causing the problem: Python Pandas Error tokenizing data. I looked up line 78 and can't see any problems. If I set error_bad_lines=False
i get an empty df suggesting there is a problem with every entry.
Notably this works when I use np.loadtxt()
:
pd.DataFrame(np.loadtxt('http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt',
comments = '%'))
0 1 2 3 4 5 6 7 8 9 10 11
0 1850.0 1.0 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850.0 2.0 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850.0 3.0 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
3 1850.0 4.0 -0.680 0.367 NaN NaN NaN NaN NaN NaN NaN NaN
4 1850.0 5.0 -0.687 0.298 NaN NaN NaN NaN NaN NaN NaN NaN
This suggests to me that there isn't something wrong with the file, but rather with how I am calling pd.read_table()
. I looked through the documentation for np.loadtxt()
in the hope of setting the sep to the same value, but that just shows: delimiter=None
(https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html).
I'd prefer to be able to import this as a pd.DataFrame
, setting the names, rather than having to import as a matrix
and then convert to pd.DataFrame
.
What am I getting wrong?
Upvotes: 1
Views: 529
Reputation: 62413
'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Air Temperatures'
'Global Average Temperature Anomaly with Sea Ice Temperature Inferred from Water Temperatures'
air
and ice
data.import requests
import pandas as pd
import math
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# specify the data from the ranges in the file
air_header1 = data[74].split() # not used
air_header2 = [v.strip() for v in data[75].split(',')]
# combine the 2 parts of the header into a single header
air_header = air_header2[:2] + [f'{air_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(air_header2[2:])]
air_data = [v.split() for v in data[77:2125]]
h2o_header1 = data[2129].split() # not used
h2o_header2 = [v.strip() for v in data[2130].split(',')]
# combine the 2 parts of the header into a single header
h2o_header = h2o_header2[:2] + [f'{h2o_header1[math.floor(i/2)]}_{v}' for i, v in enumerate(h2o_header2[2:])]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=air_header)
h2o = pd.DataFrame(h2o_data, columns=h2o_header)
import pandas as pd
import requests
# read the file with requests
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
response = requests.get(url)
data = response.text
# convert data into a list
data = [d.strip().replace('% ', '') for d in data.split('\n')]
# manually created header
headers = ['Year', 'Month', 'Monthly_Anomaly', 'Monthly_Unc.',
'Annual_Anomaly', 'Annual_Unc.',
'Five-year_Anomaly', 'Five-year_Unc.',
'Ten-year_Anomaly', 'Ten-year_Unc.',
'Twenty-year_Anomaly', 'Twenty-year_Unc.']
# separate the air and h2o data
air_data = [v.split() for v in data[77:2125]]
h2o_data = [v.split() for v in data[2132:4180]]
# create the dataframes
air = pd.DataFrame(air_data, columns=headers)
h2o = pd.DataFrame(h2o_data, columns=headers)
air
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.777 0.412 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.239 0.458 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.426 0.447 NaN NaN NaN NaN NaN NaN NaN NaN
h2o
Year Month Monthly_Anomaly Monthly_Unc. Annual_Anomaly Annual_Unc. Five-year_Anomaly Five-year_Unc. Ten-year_Anomaly Ten-year_Unc. Twenty-year_Anomaly Twenty-year_Unc.
0 1850 1 -0.724 0.370 NaN NaN NaN NaN NaN NaN NaN NaN
1 1850 2 -0.221 0.430 NaN NaN NaN NaN NaN NaN NaN NaN
2 1850 3 -0.443 0.419 NaN NaN NaN NaN NaN NaN NaN NaN
Upvotes: 2
Reputation: 4060
This one is quite tricky. Please try out the snippet code below:
import pandas as pd
url = 'http://berkeleyearth.lbl.gov/auto/Global/Land_and_Ocean_complete.txt'
df = pd.read_csv(url,
sep='\s+',
comment='%',
usecols=(0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11),
names=('Year', 'Month', 'M.Anomaly', 'M.Unc.', 'A.Anomaly',
'A.Unc.','5y.Anomaly', '5y.Unc.' ,'10y.Anomaly', '10y.Unc.',
'20y.Anomaly', '20y.Unc.'))
Upvotes: 2