Reputation: 31
Looking to grab data from tables on a specific esport site and I appear to be struggling.
Was told that pandas library could help me achieve this with just a few lines.
import pandas as pd
tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9')
print(tables[0])
I try to edit it to make mine work yet I have no success.
import pandas as pd
from urllib.request import Request, urlopen
req = Request('https://www.hltv.org/stats/teams/matches/5752/Cloud9', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9)
print(tables[0])
I was lead to believe this may be the solution I was looking for, or something similar to it, but when I try to resolve the issue in this fashion, I have no success.
"Traceback (most recent call last):
File "C:\Users\antho\OneDrive\Documents\Python\tables clloud9.py", line 6, in <module>
webpage = urlopen(req).read()
File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\antho\AppData\Local\Programs\Python\Python37-32\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden"
All I want at the moment is for the table on the link to be pulled.
Upvotes: 3
Views: 4214
Reputation: 1844
This is probably because server security feature which blocks known spider/bot user agents Urlib can be easily spotted and blocked by anti-scraping tools, especially with the header you are using. Try passing through one of the user agents found here into the header and see if one of those ones works.
However in your specific case, the robots.txt file disallows crawlers on the stats page, so they are probably blocking all known crawlers including the Urllib.
Follow the example here to try scraping using Selenium. Selenium looks more like a user than a scraper so it is often (at least for me) used as a work around when you get a HTTP Error 403: Forbidden".
Upvotes: 2
Reputation: 46
import pandas as pd
from urllib.request import Request, urlopen
req = Request('https://www.hltv.org/stats/teams/matches/5752/Cloud9', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
tables = pd.read_html ('https://www.hltv.org/stats/teams/matches/5752/Cloud9') #here was the err
print(tables[0])
Upvotes: 3