Reputation: 4035
I tried every 'User-Agent'
in here, still I get urllib.error.HTTPError: HTTP Error 400: Bad Request
. I also tried this, but I get urllib.error.URLError: File Not Found
. I have no idea what to do, my current codes are;
from bs4 import BeautifulSoup
import urllib.request,json,ast
with open ("urller.json") as f:
cc = json.load(f) #the file I get links, you can try this link instead of this
#cc = ../games/index.php?g_id=23521&game=0RBITALIS
for x in ast.literal_eval(cc): #cc is a str(list) so I have to convert
if x.startswith("../"):
r = urllib.request.Request("http://www.game-debate.com{}".format(x[2::]),headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
#x[2::] because I removed '../' parts from urlls
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
for y in soup.find_all("ul",attrs={'class':['devDefSysReqList']}):
print (y.text)
Edit: If you try only 1 link probably it won't show any error, since I get the error every time at 6th link.
Upvotes: 1
Views: 382
Reputation: 180411
A quick fix is to replace the space with +
:
url = "http://www.game-debate.com"
r = urllib.request.Request(url + x[2:] ,headers={'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
A better option may be to let urllib quote the params:
from bs4 import BeautifulSoup
import urllib.request,json,ast
from urllib.parse import quote, urljoin
with open ("urller.json") as f:
cc = json.load(f) #the file I get links, you can try this link instead of this
url = "http://www.game-debate.com"
for x in ast.literal_eval(cc): # cc is a str(list) so I have to convert
if x.startswith("../"):
r = urllib.request.Request(urljoin(url, quote(x.lstrip("."))), headers={
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'})
rr = urllib.request.urlopen(r).read()
soup = BeautifulSoup(rr)
print(rr.decode("utf-8"))
for y in soup.find_all("ul", attrs={'class':['devDefSysReqList']}):
print (y.text)
Spaces in a url are not valid and need to be percent encoded as %20
or replaced with +
.
Upvotes: 1