Reputation: 5529
I wrote a script to find spelling mistakes in SO questions' titles. I used it for about a month.This was working fine.
But now, when I try to run it, I am getting this.
Traceback (most recent call last):
File "copyeditor.py", line 32, in <module>
find_bad_qn(i)
File "copyeditor.py", line 15, in find_bad_qn
html = urlopen(url)
File "/usr/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.4/urllib/request.py", line 469, in open
response = meth(req, response)
File "/usr/lib/python3.4/urllib/request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.4/urllib/request.py", line 507, in error
return self._call_chain(*args)
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/lib/python3.4/urllib/request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
This is my code
import json
from urllib.request import urlopen
from bs4 import BeautifulSoup
from enchant import DictWithPWL
from enchant.checker import SpellChecker
my_dict = DictWithPWL("en_US", pwl="terms.dict")
chkr = SpellChecker(lang=my_dict)
result = []
def find_bad_qn(a):
url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
html = urlopen(url)
bsObj = BeautifulSoup(html, "html5lib")
que = bsObj.find_all("div", class_="question-summary")
for div in que:
link = div.a.get('href')
name = div.a.text
chkr.set_text(name.lower())
list1 = []
for err in chkr:
list1.append(chkr.word)
if (len(list1) > 1):
str1 = ' '.join(list1)
result.append({'link': link, 'name': name, 'words': str1})
print("Please Wait.. it will take some time")
for i in range(298314,298346):
find_bad_qn(i)
for qn in result:
qn['link'] = "https://stackoverflow.com" + qn['link']
for qn in result:
print(qn['link'], " Error Words:", qn['words'])
url = qn['link']
UPDATE
This is the url causing the problem.Even though this url exists.
https://stackoverflow.com/questions?page=298314&sort=active
I tried changing the range to some lower values. It works fine now.
Why this happened with above url?
Upvotes: 16
Views: 119676
Reputation: 11
Check by clicking on the link . Maybe it is present in the code that means there is no problem with your code but actually the link or site is not there that is not found.
Upvotes: 0
Reputation: 298
The default 'User-Agent' doesn't seem to have as much access as Mozilla.
Try importing Request and append , headers={'User-Agent': 'Mozilla/5.0'}
to the end of your url.
ie:
from urllib.request import Request, urlopen
url = f"https://stackoverflow.com/questions?page={str(a)}&sort=active"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req)
Upvotes: 7
Reputation: 309
It is because URL doesn't exist please recheck your URL. I also had same issue during rechecking I found that my URL is not right then I changed it
Upvotes: 2
Reputation: 409
I have exactly the same problem. The url that I want to get using urllib exists and is accessible using normal browser, but urllib is telling me 404.
The solution for me is not use urllib:
import requests
requests.get(url)
This works for me.
Upvotes: 10
Reputation: 1750
So apparently the default display number of questions per page is 50 so the range you defined in the loop goes out of the available number of pages with 50 questions per page. The range should be adapted to be within the number of total pages with 50 questions each.
This code will catch the 404 error which was the reason you got an error and ignore it just in case you go out of the range.
from urllib.request import urlopen
def find_bad_qn(a):
url = "https://stackoverflow.com/questions?page=" + str(a) + "&sort=active"
try:
urlopen(url)
except:
pass
print("Please Wait.. it will take some time")
for i in range(298314,298346):
find_bad_qn(i)
Upvotes: 11