Reputation: 767
Trying to use BeautifulSoup to parse image URLs from Bing image results.
This initially behaves as expected:
from bs4 import BeautifulSoup
import requests
def get_soup(url):
return BeautifulSoup(requests.get(url).text)
query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
"&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
The following, though, returns an empty list rather than a list of URLs:
bimg = re.compile("mm.bing.net")
img_links = soup.find_all("img", {"src": bimg})
print img_links
When I print soup.prettify()
I can see the URLs I want. Looks like all img tags may be sitting within a script--could that be playing a role in BS4 not seeing them?
Here is some of the prettify output that contains the URLs.
<script type="text/javascript">
//<![CDATA[
var t = '<div class="iol_fp" id="iol_bg"></div><div id="iol_ph"></div><div id="iol_dp"><button id="iol_cls" title="Close"></button><div id="iol_ip"><div id="iol_imp">
<div id="iol_imw"></div><div class="iol_nav" id="iol_navl"></div><div class="iol_nav" id="iol_navr"></div></div><div id="iol_mdb"><span class="iol_mdi" id="iol_md"><span id="iol_mdis"></span><span id="iol_sep">·</span><a id="iol_mdit"></a></span>
<span id="iol_bspan"><button class="iol_mdi" id="iol_pin" href="#" title="Pin to Pinterest"></button><button class="iol_mdi" id="iol_vl" href="#">Show larger</button><button class="iol_mdi" id="iol_vs" href="#">Show smaller</button>
<button class="iol_mdi" id="iol_ss" href="#">Play All</button><button class="iol_mdi" id="iol_sse" href="#">Pause</button></span></div><div id="iol_fsw"><div id="iol_fscb"></div><div id="iol_fsc"></div></div></div><div id="iol_sp"><div id="iol_rs">
<div id="iol_rst">ALSO CONSIDER</div><span id="iol_rsp"><div><div class="iol_rsc"><a href="/images/search?q=Doggy+GIF+Style+1+2+3&Form=IQFRDR" class="iol_rsi" title="Search for: Doggy GIF Style 1 2 3" h="ID=images,5187.2">
<img src="http://ts4.mm.bing.net/th?q=Doggy+GIF+Style+1+2+3&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq">Doggy<br/><strong>GIF Style 1 2 3</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Puppies&Form=IQFRDR" class="iol_rsi" title="Search for: Puppies" h="ID=images,5189.2"><img src="http://ts1.mm.bing.net/th?q=Puppies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Puppies</strong></span></a></div><div class="iol_rsc"><a href="/images/search?q=Funny+Doggies&Form=IQFRDR" class="iol_rsi" title="Search for: Funny Doggies" h="ID=images,5191.2">
<img src="http://ts4.mm.bing.net/th?q=Funny+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Funny</strong><br/>Doggies</span></a></div><div class="iol_rsc"><a href="/images/search?q=Doggie+Dentures&Form=IQFRDR" class="iol_rsi" title="Search for: Doggie Dentures" h="ID=images,5193.2">
<img src="http://ts1.mm.bing.net/th?q=Doggie+Dentures&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Doggie Dentures</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Cute+Doggies&Form=IQFRDR" class="iol_rsi" title="Search for: Cute Doggies" h="ID=images,5195.2"><img src="http://ts3.mm.bing.net/th?q=Cute+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Cute</strong><br/>Doggies
Any help would be much appreciated!
Upvotes: 0
Views: 1304
Reputation: 767
@alecxe was on the right track--this was an issue with html5. I installed the html5lib
library and the following code resolved the issue:
from bs4 import BeautifulSoup
import requests
import html5lib
def get_soup(url):
return BeautifulSoup(requests.get(url).text, 'html5lib')
query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
"&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
Thanks for the help.
Upvotes: 1
Reputation: 12747
from bs4 import BeautifulSoup
import requests
import re
def get_soup(url):
request = requests.get(url).content
return BeautifulSoup(request)
query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query + "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
bimg = re.compile('.*mm.bing.net.*')
img_links = soup.find_all("img", {'src': bimg})
for link in img_links:
print link
Tweaked your regex a bit
<img src="http://ts3.mm.bing.net/th?q=Rabbit&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cow&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Tiger&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Elephant&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Fish&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Fox&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Animal&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Chicken+Bird&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Domestic+Sheep&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Giraffe&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Puppy&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Dolphin&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Pet&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Baby+Birds&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Labrador+Retriever&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Chihuahua&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cat&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Lion&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Zebra&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Bulldog&w=50&h=50&c=1&pid=1.7&mkt=en-CA&adlt=moderate&t=1"/>
Upvotes: 0
Reputation: 283
import urllib, bs4
from bs4 import *
url = "http://www.bing.com/images/search?q=%s&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3" % 'doggy'
html_page = urllib.urlopen(url)
soup = BeautifulSoup(html_page)
links = soup.find_all("img")
img_links = []
for link in links:
img_links.append(str(link.get('src')))
for x in range(0, 10):
for x in range(0, len(img_links)):
try:
if "http://" in img_links[x]:
pass
else:
del img_links[x]
except:
break
Try this.
The links should be in the list img_links
.
Upvotes: 0