Reputation: 21
So I have some code that I use to scrape through my mailbox looking for certain URL's. Once this is completed it creates a file called links.txt
I want to run a script against that file to get an output of all the current URL's that are live in that list. The script I have only allows for me to check on URL at a time
import urllib2
for url in ["www.google.com"]:
try:
connection = urllib2.urlopen(url)
print connection.getcode()
connection.close()
except urllib2.HTTPError, e:
print e.getcode()
Upvotes: 0
Views: 541
Reputation: 184345
It is trivial to make this change, given that you're already iterating over a list of URLs:
import urllib2
for url in open("urllist.txt"): # change 1
try:
connection = urllib2.urlopen(url.rstrip()) # change 2
print connection.getcode()
connection.close()
except urllib2.HTTPError, e:
print e.getcode()
Iterating over a file returns the lines of the file (complete with line endings). We use rstrip()
on the URL to strip off the line endings.
There are other improvements you can make. For example, some will suggest you use with
to make sure your file is closed. This is good practice but probably not necessary in this script.
Upvotes: 1
Reputation: 3235
Use requests:
import requests
with open(filename) as f:
good_links = []
for link in file:
try:
r = requests.get(link.strip())
except Exception:
continue
good_links.append(r.url) #resolves redirects
You can also consider extracting the call to requests.get into a helper function:
def make_request(method, url, **kwargs):
for i in range(10):
try:
r = requests.request(method, url, **kwargs)
return r
except requests.ConnectionError as e:
print e.message
except requests.HTTPError as e:
print e.message
except requests.RequestException as e:
print e.message
raise Exception("requests did not succeed")
Upvotes: 4