Reputation: 885
I am trying to access a csv programmatically at the following url: http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate=20180627&reportType=F&productId=425
I have tried this in two ways, one by simply passing the URL to data_sheet = pd.read_csv(sheet_url)
. I receive an HTTP Error 403: Forbidden
exception when trying with this method.
def get_sheet(self):
# Accesses CME direct URL (at the moment...will add functionality for ICE later)
# Gets sheet and puts it in dataframe
#Returns dataframe sheet
sheet_url = "http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate="+str(self.date_of_report)+"&reportType="\
+ str(self.report_type)+"&productId=" + str(self.product)
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
data_sheet = pd.read_csv(sheet_url)
return data_sheet
I have also tried pretending to be a browser thinking that perhaps the site was not allowing a direct call to the csv, but then I receive an Invalid file path or buffer object type: <class 'requests.models.Response'>
exception
def get_sheet(self):
# Accesses CME direct URL (at the moment...will add functionality for ICE later)
# Gets sheet and puts it in dataframe
#Returns dataframe sheet
sheet_url = "http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate="+str(self.date_of_report)+"&reportType="\
+ str(self.report_type)+"&productId=" + str(self.product)
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
req = requests.get(url = sheet_url, headers = header)
data_sheet = pd.read_csv(req)
return data_sheet
My end goal is to simply retrieve the CSV at that URL and and return a data frame. What am I missing?
UPDATE: I have done some more fiddling and just printed req
and i get an output of Response [200]
which from what I can see in the HTTP docs means the server is receiving my information. Does anyone know if the issue is that I am directly accessing a URL where a csv is stored that normally, if you click the button the url is associated with, it automatically downloads the file. In checking my downloads folder I do not see any downloads for the file. So while the server might be receiving a valid request I may not be handling the url behavior correctly. Any ideas?
Upvotes: 4
Views: 7208
Reputation: 423
Since 1.2
of pandas
, it is possible to tune the used reader by adding options as dictionary keys to the storage_options
parameter of read_excel
. So by invoking it with
import pandas as pd
url = ''
storage_options = {'User-Agent': 'Mozilla/5.0'}
df = pd.read_excel(url, storage_options=storage_options)
the library will include the User-Agent
header to the request so you don't have to set it up externally and before to the invocation of read_excel
.
Upvotes: 1
Reputation: 21
You can simply use requests with headers to avoid the forbidden 403 error, and then do a skiprows while reading the excel file, to make sure the image inside the file doesn't create problems when importing to python.
import pandas as pd
import requests
hdr = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"} #change the version of the browser accordingly
resp = requests.get('http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate=20180627&reportType=F&productId=425', headers = hdr)
pd.read_excel(resp.content, skiprows = range(0,5))
Upvotes: 1
Reputation: 3138
There are two things wrong with your code:
You are passing a response object to pandas,
data_sheet = pd.read_csv(sheet_url)
when your actual csv data is in sheet_url.content
pandas cannot read csv
from string
, pd.read_csv
only works with file objects. Thus to read your downloaded content, you need to create a file either using string writer to create a physical file or use io.StringIO(response.content.decode('utf-8'))
An example of using io module is:
import requests
import io
import pandas as pd
response = requests.get('http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv')
file_object = io.StringIO(response.content.decode('utf-8'))
pd.read_csv(file_object)
Upvotes: 7