MikeRand
MikeRand

Reputation: 4828

Getting Numpy overflow despite declaring dtype=int64

I'm Downloading stock prices from Yahoo for the S&P500, which has volume too big for a 32-bit integer.

def yahoo_prices(ticker, start_date=None, end_date=None, data='d'):

    csv = yahoo_historical_data(ticker, start_date, end_date, data)

    d = [('date',      np.datetime64),
         ('open',      np.float64),
         ('high',      np.float64),
         ('low',       np.float64),
         ('close',     np.float64),
         ('volume',    np.int64),
         ('adj_close', np.float64)]

    return np.recfromcsv(csv, dtype=d)

Here's the error:

>>> sp500 = yahoo_prices('^GSPC')
Traceback (most recent call last):
  File "<stdin>", line 108, in <module>
  File "<stdin>", line 74, in yahoo_prices
  File "/usr/local/lib/python2.6/dist-packages/numpy/lib/npyio.py", line 1812, in recfromcsv
    output = genfromtxt(fname, **kwargs)
  File "/usr/local/lib/python2.6/dist-packages/numpy/lib/npyio.py", line 1646, in genfromtxt
    output = np.array(data, dtype=ddtype)
OverflowError: long int too large to convert to int

Why would I still be getting this error if I declared the dtype to use int64? Is this an indication that the io function isn't really using my dtype sequence d?

===Edit ... example csv added===

Date,Open,High,Low,Close,Volume,Adj Close
2012-06-15,1329.19,1343.32,1329.19,1342.84,4401570000,1342.84
2012-06-14,1314.88,1333.68,1314.14,1329.10,3687720000,1329.10
2012-06-13,1324.02,1327.28,1310.51,1314.88,3506510000,1314.88

Upvotes: 3

Views: 1219

Answers (2)

KobeJohn
KobeJohn

Reputation: 7545

I'm not sure, but I think you found a bug in numpy. I filed it here.

As I said there, if you open npyio.py and edit this line within recfromcsv:

kwargs.update(dtype=kwargs.get('update', None),

to this:

kwargs.update(dtype=kwargs.get('dtype', None),

Then it works for me with no problem for the long integer (I didn't check the datetime correctness as Joe wrote in his answer). You may notice that your dates weren't being converted either. Here is the specific code that works. The contents of "test.csv" are copy pasted from your example csv data.

import numpy as np
d = [('date',      np.datetime64),
    ('open',      np.float64),
    ('high',      np.float64),
    ('low',       np.float64),
    ('close',     np.float64),
    ('volume',    np.int64),
    ('adj_close', np.float64)]
a = np.recfromcsv("test.csv", dtype=d)
print(a)

[ (datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 1329.19, 1343.32, 1329.19, 1342.84, 4401570000, 1342.84)
 (datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 1314.88, 1333.68, 1314.14, 1329.1, 3687720000, 1329.1)
 (datetime.datetime(1969, 12, 31, 23, 59, 59, 999999), 1324.02, 1327.28, 1310.51, 1314.88, 3506510000, 1314.88)]

Update: If you don't want to modify numpy, just use the relevant numpy code for recfromcsv

I've also "fixed" the datetime issue by using a native python object in the datetime field. I don't know if that will work for you.

import datetime
import numpy as np

d = [('date',     datetime.datetime),
    ('open',      np.float64),
    ('high',      np.float64),
    ('low',       np.float64),
    ('close',     np.float64),
    ('volume',    np.int64),
    ('adj_close', np.float64)]

#a = np.recfromcsv("test.csv", dtype=d)
kwargs = {"dtype": d}
case_sensitive = kwargs.get('case_sensitive', "lower") or "lower"
names = kwargs.get('names', True)
kwargs.update(
    delimiter=kwargs.get('delimiter', ",") or ",",
    names=names,
    case_sensitive=case_sensitive)
output = np.genfromtxt("test.csv", **kwargs)
output = output.view(np.recarray)

print(output)

Upvotes: 3

Joe Kington
Joe Kington

Reputation: 284602

You need to convert your date strings to actual dates. The formats in your dtype are being ignored because the first column can't be directly converted to a datetime.

numpy expects you to be fairly explicit and refuses to guess date formats.

(Edit: This used to be the case, but isn't anymore.)

It expects datetime objects. See dateutil.parser if you want to guess date/time formats from strings.

At any rate, you'll want something like the following:

from cStringIO import StringIO
import datetime as dt
import numpy as np

dat = """Date,Open,High,Low,Close,Volume,Adj Close
2012-06-15,1329.19,1343.32,1329.19,1342.84,4401570000,1342.84
2012-06-14,1314.88,1333.68,1314.14,1329.10,3687720000,1329.10
2012-06-13,1324.02,1327.28,1310.51,1314.88,3506510000,1314.88"""
infile = StringIO(dat)

d = [('date',      np.datetime64),
     ('open',      np.float64),
     ('high',      np.float64),
     ('low',       np.float64),
     ('close',     np.float64),
     ('volume',    np.int64),
     ('adj_close', np.float64)]


def parse_date(item):
    return dt.datetime.strptime(item, '%Y-%M-%d')

data = np.recfromcsv(infile, converters={0:parse_date}, dtype=d)

However, things like this are where pandas shines. Consider using something like the following:

from cStringIO import StringIO
import pandas

dat = """Date,Open,High,Low,Close,Volume,Adj Close
2012-06-15,1329.19,1343.32,1329.19,1342.84,4401570000,1342.84
2012-06-14,1314.88,1333.68,1314.14,1329.10,3687720000,1329.10
2012-06-13,1324.02,1327.28,1310.51,1314.88,3506510000,1314.88"""

infile = StringIO(dat)
data =  pandas.read_csv(infile, index_col=0, parse_dates=True)

Upvotes: 1

Related Questions