Reputation: 164
I am trying to download a CSV file with Python 3.x The path of the file is: https://www.nseindia.com/content/fo/fo_mktlots.csv
I have found three ways to do it. Of the three only one method works. I wanted to know why or what I am doing wrong.
Method 1: (Unsuccessful)
import pandas as pd
mytable = pd.read_table("https://www.nseindia.com/content/fo/fo_mktlots.csv",sep=",")
print(mytable)
But I am getting the following error :
- HTTPError: HTTP Error 403: Forbidden
Method 2: (Unsuccessful)
from urllib.request import Request, urlopen
url='https://www.nseindia.com/content/fo/fo_mktlots.csv'
url_request = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(url_request ).read()
Got the same error as before :
- HTTPError: HTTP Error 403: Forbidden
Method 3: (Successful)
import requests
import pandas as pd
url = 'https://www.nseindia.com/content/fo/fo_mktlots.csv'
r = requests.get(url)
df = pd.read_csv(StringIO(r.text))
I am also able to open the file with Excel VBA as below:
Workbooks.Open Filename:="https://www.nseindia.com/content/fo/fo_mktlots.csv"
Also, is there any other method to do the same?
Upvotes: 6
Views: 14627
Reputation: 18106
The website tries to prevent content scraping.
The issue is not about what you are doing wrong, it is more about how the web server is configured and how it behaves in various situations.
But to overcome the scraping protection, create well defined http request headers, the best way to do so is to send a complete set of http headers a real web browser does.
Here it works with a minimal set:
>>> myHeaders = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36', 'Referer': 'https://www.nseindia.com', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
>>> url_request = Request(url, headers=myHeaders)
>>> html = urlopen(url_request ).read()
>>> len(html)
42864
>>>
You can pass urllib to pandas:
>>> import pandas as pd
...
>>> url_request = Request(url, headers=myHeaders)
>>> data = urlopen(url_request )
>>> my_table = pd.read_table(data)
>>> len(my_table)
187
Upvotes: 5
Reputation: 423
Since 1.2
of pandas
, it is possible to tune the used reader by adding options as dictionary keys to the storage_options
parameter of read_table
. So by invoking it with
import pandas as pd
url = ''
storage_options = {'User-Agent': 'Mozilla/5.0'}
df = pd.read_table(url, storage_options=storage_options)
the library will include the User-Agent
header to the request so you don't have to set it up externally and before to the invocation of read_table
.
Upvotes: 0