Reputation: 48
I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:
link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
writer.write(str(line))
writer.write("\n")
writer.close()
When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)
Upvotes: 1
Views: 1424
Reputation: 4666
There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.
One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.
If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:
from collections import namedtuple
from contextlib import suppress
import requests
from requests import ReadTimeout
Link = namedtuple("Link", ["url", "filename"])
links = {
Link("https://eu.mouser.com/", "mouser.com"),
Link("https://example.com/", "example1.com"),
Link("https://example.com/", "example2.com"),
}
for link in links:
with suppress(ReadTimeout):
response = requests.get(link.url, timeout=3)
with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
file.write(response.text)
where such protected sites which lead to ReadTimeOut
errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3)
with a suitable headers
parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.
Upvotes: 1
Reputation: 36510
urllib.request.urlopen
docs claims that
The optional timeout parameter specifies a timeout in seconds for blocking operations like the connection attempt (if not specified, the global default timeout setting will be used). This actually only works for HTTP, HTTPS and FTP connections.
without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout
import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)
gives
socket.timeout: The read operation timed out
Upvotes: 3