Reputation: 554
So my brother wanted me to write a web crawler in Python (self-taught) and I know C++, Java, and a bit of html. I'm using version 2.7 and reading the python library, but I have a few problems
1. httplib.HTTPConnection
and request
concept to me is new and I don't understand if it downloads an html script like cookie or an instance. If you do both of those, do you get the source for a website page? And what are some words that I would need to know to modify the page and return the modified page.
Just for background, I need to download a page and replace any img with ones I have
And it would be nice if you guys could tell me your opinion of 2.7 and 3.1
Upvotes: 27
Views: 119316
Reputation: 1
Here you have a code to this task:
import requests
from requests.exceptions import RequestException
from datetime import datetime
import urllib.parse
def fetch_url(url, retries=3):
headers = {
"User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
}
for attempt in range(retries):
try:
response = requests.get(url, headers=headers, timeout=10, allow_redirects=True)
if response.status_code == 200:
response.encoding = response.apparent_encoding
return response.text
else:
print(f"Error: {response.status_code}")
except RequestException as e:
print(f"Attempt {attempt + 1} failed: {e}")
return None
def get_filename_from_url(url):
parsed_url = urllib.parse.urlparse(url)
domain = parsed_url.netloc.replace("www.", "")
timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
filename = f"{domain}_{timestamp}.html"
return filename
url = input("Introduce la URL: ")
source_code = fetch_url(url)
if source_code:
filename = get_filename_from_url(url)
with open(filename, "w", encoding="utf-8") as file:
file.write(source_code)
print(f"El código fuente se ha guardado en {filename}")
else:
print("Failed to retrieve the webpage after multiple attempts.")
Upvotes: 0
Reputation: 3758
All the above will fail on an https request behind Cloudflare. You can try this to fetch both http and https html:
import requests
url = 'https://your.link.here'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print(response.text)
else:
print(f'Request failed with status code: {response.status_code}')
Upvotes: 0
Reputation: 59328
If you are using Python > 3.x
you don't need to install any libraries, this is directly built in the python framework. The old urllib2
package has been renamed to urllib
:
from urllib import request
response = request.urlopen("https://www.google.com")
# set the correct charset below
page_source = response.read().decode('utf-8')
print(page_source)
Upvotes: 6
Reputation: 3227
An Example with python3
and the requests
library as mentioned by @leoluk:
pip install requests
Script req.py:
import requests
url='http://localhost'
# in case you need a session
cd = { 'sessionid': '123..'}
r = requests.get(url, cookies=cd)
# or without a session: r = requests.get(url)
r.content
Now,execute it and you will get the html source of localhost!
python3 req.py
Upvotes: 11
Reputation: 12981
Use Python 2.7, is has more 3rd party libs at the moment. (Edit: see below).
I recommend you using the stdlib module urllib2
, it will allow you to comfortably get web resources.
Example:
import urllib2
response = urllib2.urlopen("http://google.de")
page_source = response.read()
For parsing the code, have a look at BeautifulSoup
.
BTW: what exactly do you want to do:
Just for background, I need to download a page and replace any img with ones I have
Edit: It's 2014 now, most of the important libraries have been ported, and you should definitely use Python 3 if you can. python-requests
is a very nice high-level library which is easier to use than urllib2
.
Upvotes: 48
Reputation: 86774
The first thing you need to do is read the HTTP spec which will explain what you can expect to receive over the wire. The data returned inside the content will be the "rendered" web page, not the source. The source could be a JSP, a servlet, a CGI script, in short, just about anything, and you have no access to that. You only get the HTML that the server sent you. In the case of a static HTML page, then yes, you will be seeing the "source". But for anything else you see the generated HTML, not the source.
When you say modify the page and return the modified page
what do you mean?
Upvotes: 0