Anthony Oakley
Anthony Oakley

Reputation: 31

Why am I getting a HTTP 403 error with Pandas?

Looking to grab data from tables on a specific esport site and I appear to be struggling.

Was told that pandas library could help me achieve this with just a few lines.

import pandas as pd


tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9')

print(tables[0])

I try to edit it to make mine work yet I have no success.

import pandas as pd

from urllib.request import Request, urlopen

req = Request('https://www.hltv.org/stats/teams/matches/5752/Cloud9',     headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9)

print(tables[0])

I was lead to believe this may be the solution I was looking for, or something similar to it, but when I try to resolve the issue in this fashion, I have no success.

"Traceback (most recent call last):
  File "C:\Users\antho\OneDrive\Documents\Python\tables clloud9.py", line 6, in <module>
webpage = urlopen(req).read()
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 531, in open
response = meth(req, response)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
  File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden"

All I want at the moment is for the table on the link to be pulled.

Upvotes: 3

Views: 4214

Answers (2)

Edeki Okoh
Edeki Okoh

Reputation: 1844

This is probably because server security feature which blocks known spider/bot user agents Urlib can be easily spotted and blocked by anti-scraping tools, especially with the header you are using. Try passing through one of the user agents found here into the header and see if one of those ones works.

However in your specific case, the robots.txt file disallows crawlers on the stats page, so they are probably blocking all known crawlers including the Urllib.

Follow the example here to try scraping using Selenium. Selenium looks more like a user than a scraper so it is often (at least for me) used as a work around when you get a HTTP Error 403: Forbidden".

Upvotes: 2

Yassine Baghdadi
Yassine Baghdadi

Reputation: 46

import pandas as pd

from urllib.request import Request, urlopen

req = Request('https://www.hltv.org/stats/teams/matches/5752/Cloud9',     headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9') #here was the err

print(tables[0])

Upvotes: 3

Related Questions