Reputation: 8057
I do not understand what is wrong with my script below.
It is supposed to parse out images using regex. I've verified that my regex is correct by using http://regex101.com/.
The problem is it doesn't even grab the first image on the website (even it should?).
The website in the script is a NSFW blog. Please don't go to the link if you are offended by nudity or sexuality.
from urllib2 import urlopen
import re
base = "http://bassrx.tumblr.com"
url = "http://bassrx.tumblr.com/tagged/tt"
def parse_page(url):
# returns html for parsing
page = urlopen(url)
html = page.read()
return html
def get_links(html):
# returns list of all image urls on page
jpgs = re.findall("src.\"(.*?500.jpg)", html, re.IGNORECASE)
#pngs = re.findall("src.\"(.*?media.tumblr.*?tumblr_.*?png)", html, re.IGNORECASE)
#links = jpgs + pngs
return jpgs
html = parse_page(url) # get the html for first page
links = get_links(html) # get all relevant image links
print links
The very first image has the following HTML:
src="http://37.media.tumblr.com/tumblr_m9q9feJcxl1qi02clo3_500.jpg" alt="">
I would like to know why it doesn't grab this image (and also misses most of the others).
Upvotes: 0
Views: 174
Reputation: 70732
Consider using BeautifulSoup to do this..
>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> import re
>>> page = urlopen('http://bassrx.tumblr.com/tagged/tt')
>>> soup = BeautifulSoup(page.read())
>>> [x['src'] for x in soup.find_all('img',{'src':re.compile('500\.jpg$')})]
Output
[
u'http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg',
u'http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg',
u'http://38.media.tumblr.com/c84fce183b6220eba854ce8933a13110/tumblr_n3lxgtqp7K1qi02clo1_500.jpg'
]
If you want the entire image tag, use the following:
>>> soup.find_all('img',{'src':re.compile('500\.jpg$')})
Upvotes: 1