Trexion Kameha
Trexion Kameha

Reputation: 3580

Python: Loading zip file stored in CSV from Web

I installed pandas 3.5 (against some of your suggestions) and cannot seem to figure out why the new code won't load the zip file from an URL:

import pandas as pd
import numpy as np
from io import StringIO
from zipfile import ZipFile
from urllib.request import urlopen
url = urlopen("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip")

#Download Zipfile and create pandas DataFrame
zipfile = ZipFile(StringIO(url.read()))
FFdata = pd.read_csv(zipfile.open('F-F_Research_Data_Factors.CSV'), 
                     header = 0, names = ['Date','MKT-RF','SMB','HML','RF'], 
                     skiprows=3)

I believe its failing on the urlopen function. But it doesn't work when substituting the URL as a text string.

Does anyone know what's happening? Thank you!

Upvotes: 1

Views: 2248

Answers (1)

tdelaney
tdelaney

Reputation: 77337

Running your program I get the error

Traceback (most recent call last):
  File "c.py", line 9, in <module>
    zipfile = ZipFile(StringIO(url.read()))
TypeError: initial_value must be str or None, not bytes

A quick test confirms that the problem is you are passing a byte string to StringIO.

td@mintyfresh ~/tmp $ python3
Python 3.4.3 (default, Nov 17 2016, 01:08:31) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> io.StringIO(b'aaa')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: initial_value must be str or None, not bytes

The solution is simple.... just use an io.BytesIO object instead. This is a common error because the StringIO would have worked in python 2 and lots of examples are 2.x based.

import pandas as pd
import numpy as np
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
url = urlopen("http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_Factors_CSV.zip")

#Download Zipfile and create pandas DataFrame
zipfile = ZipFile(BytesIO(url.read()))
FFdata = pd.read_csv(zipfile.open('F-F_Research_Data_Factors.CSV'), 
                     header = 0, names = ['Date','MKT-RF','SMB','HML','RF'], 
                     skiprows=3)

Upvotes: 3

Related Questions