StormsEdge
StormsEdge

Reputation: 885

Receiving HTTP Error 403: Forbidden CSV download

I am trying to access a csv programmatically at the following url: http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate=20180627&reportType=F&productId=425

I have tried this in two ways, one by simply passing the URL to data_sheet = pd.read_csv(sheet_url). I receive an HTTP Error 403: Forbidden exception when trying with this method.

def get_sheet(self):
        # Accesses CME direct URL (at the moment...will add functionality for ICE later)
        # Gets sheet and puts it in dataframe
        #Returns dataframe sheet

        sheet_url = "http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate="+str(self.date_of_report)+"&reportType="\
        + str(self.report_type)+"&productId=" + str(self.product)

        header = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
            "X-Requested-With": "XMLHttpRequest"
        }

        data_sheet = pd.read_csv(sheet_url)

        return data_sheet

I have also tried pretending to be a browser thinking that perhaps the site was not allowing a direct call to the csv, but then I receive an Invalid file path or buffer object type: <class 'requests.models.Response'> exception

def get_sheet(self):
        # Accesses CME direct URL (at the moment...will add functionality for ICE later)
        # Gets sheet and puts it in dataframe
        #Returns dataframe sheet

        sheet_url = "http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate="+str(self.date_of_report)+"&reportType="\
        + str(self.report_type)+"&productId=" + str(self.product)

        header = {
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
            "X-Requested-With": "XMLHttpRequest"
        }

        req = requests.get(url = sheet_url, headers = header)

        data_sheet = pd.read_csv(req)

        return data_sheet

My end goal is to simply retrieve the CSV at that URL and and return a data frame. What am I missing?

UPDATE: I have done some more fiddling and just printed req and i get an output of Response [200] which from what I can see in the HTTP docs means the server is receiving my information. Does anyone know if the issue is that I am directly accessing a URL where a csv is stored that normally, if you click the button the url is associated with, it automatically downloads the file. In checking my downloads folder I do not see any downloads for the file. So while the server might be receiving a valid request I may not be handling the url behavior correctly. Any ideas?

Upvotes: 4

Views: 7208

Answers (3)

Oliver Schupp
Oliver Schupp

Reputation: 423

Since 1.2 of pandas, it is possible to tune the used reader by adding options as dictionary keys to the storage_options parameter of read_excel. So by invoking it with

import pandas as pd


url = ''
storage_options = {'User-Agent': 'Mozilla/5.0'}
df = pd.read_excel(url, storage_options=storage_options)

the library will include the User-Agent header to the request so you don't have to set it up externally and before to the invocation of read_excel.

Upvotes: 1

Dhivs
Dhivs

Reputation: 21

You can simply use requests with headers to avoid the forbidden 403 error, and then do a skiprows while reading the excel file, to make sure the image inside the file doesn't create problems when importing to python.

import pandas as pd    
import requests

hdr = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36",
            "X-Requested-With": "XMLHttpRequest"} #change the version of the browser accordingly

resp = requests.get('http://www.cmegroup.com/CmeWS/exp/voiProductDetailsViewExport.ctl?media=xls&tradeDate=20180627&reportType=F&productId=425', headers = hdr)

pd.read_excel(resp.content, skiprows = range(0,5))

Upvotes: 1

Simas Joneliunas
Simas Joneliunas

Reputation: 3138

There are two things wrong with your code:

  1. You are passing a response object to pandas,

    data_sheet = pd.read_csv(sheet_url) when your actual csv data is in sheet_url.content

  2. pandas cannot read csv from string, pd.read_csv only works with file objects. Thus to read your downloaded content, you need to create a file either using string writer to create a physical file or use io.StringIO(response.content.decode('utf-8'))

An example of using io module is:

import requests
import io
import pandas as pd

response = requests.get('http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv')

file_object = io.StringIO(response.content.decode('utf-8'))
pd.read_csv(file_object)

Upvotes: 7

Related Questions