jamesbev
jamesbev

Reputation: 767

BeautifulSoup not grabbing 'img src' as expected

Trying to use BeautifulSoup to parse image URLs from Bing image results.

This initially behaves as expected:

from bs4 import BeautifulSoup
import requests

def get_soup(url):
    return BeautifulSoup(requests.get(url).text)

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
      "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)

The following, though, returns an empty list rather than a list of URLs:

bimg = re.compile("mm.bing.net")
img_links = soup.find_all("img", {"src": bimg})
print img_links

When I print soup.prettify() I can see the URLs I want. Looks like all img tags may be sitting within a script--could that be playing a role in BS4 not seeing them?

Here is some of the prettify output that contains the URLs.

<script type="text/javascript">
  //<![CDATA[
var t = '<div class="iol_fp" id="iol_bg"></div><div id="iol_ph"></div><div id="iol_dp"><button id="iol_cls" title="Close"></button><div id="iol_ip"><div id="iol_imp">
<div id="iol_imw"></div><div class="iol_nav" id="iol_navl"></div><div class="iol_nav" id="iol_navr"></div></div><div id="iol_mdb"><span class="iol_mdi" id="iol_md"><span id="iol_mdis"></span><span id="iol_sep">·</span><a id="iol_mdit"></a></span>
<span id="iol_bspan"><button class="iol_mdi" id="iol_pin" href="#" title="Pin to Pinterest"></button><button class="iol_mdi" id="iol_vl" href="#">Show larger</button><button class="iol_mdi" id="iol_vs" href="#">Show smaller</button>
<button class="iol_mdi" id="iol_ss" href="#">Play All</button><button class="iol_mdi" id="iol_sse" href="#">Pause</button></span></div><div id="iol_fsw"><div id="iol_fscb"></div><div id="iol_fsc"></div></div></div><div id="iol_sp"><div id="iol_rs">
<div id="iol_rst">ALSO CONSIDER</div><span id="iol_rsp"><div><div class="iol_rsc"><a href="/images/search?q=Doggy+GIF+Style+1+2+3&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Doggy GIF Style 1 2 3" h="ID=images,5187.2">
<img src="http://ts4.mm.bing.net/th?q=Doggy+GIF+Style+1+2+3&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq">Doggy<br/><strong>GIF Style 1 2 3</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Puppies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Puppies" h="ID=images,5189.2"><img src="http://ts1.mm.bing.net/th?q=Puppies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Puppies</strong></span></a></div><div class="iol_rsc"><a href="/images/search?q=Funny+Doggies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Funny Doggies" h="ID=images,5191.2">
<img src="http://ts4.mm.bing.net/th?q=Funny+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Funny</strong><br/>Doggies</span></a></div><div class="iol_rsc"><a href="/images/search?q=Doggie+Dentures&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Doggie Dentures" h="ID=images,5193.2">
<img src="http://ts1.mm.bing.net/th?q=Doggie+Dentures&w=50&h=50&c=1&pid=1.7&adlt=moderate"/><span class="iol_rsiq"><strong>Doggie Dentures</strong></span></a></div><div class="iol_rsc">
<a href="/images/search?q=Cute+Doggies&amp;Form=IQFRDR" class="iol_rsi" title="Search for: Cute Doggies" h="ID=images,5195.2"><img src="http://ts3.mm.bing.net/th?q=Cute+Doggies&w=50&h=50&c=1&pid=1.7&adlt=moderate"/>
<span class="iol_rsiq"><strong>Cute</strong><br/>Doggies

Any help would be much appreciated!

Upvotes: 0

Views: 1304

Answers (3)

jamesbev
jamesbev

Reputation: 767

@alecxe was on the right track--this was an issue with html5. I installed the html5lib library and the following code resolved the issue:

from bs4 import BeautifulSoup
import requests
import html5lib

def get_soup(url):
   return BeautifulSoup(requests.get(url).text, 'html5lib')

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query +
  "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)

Thanks for the help.

Upvotes: 1

TankorSmash
TankorSmash

Reputation: 12747

from bs4 import BeautifulSoup
import requests
import re

def get_soup(url):
    request = requests.get(url).content
    return BeautifulSoup(request)

query = 'doggy'
url = "http://www.bing.com/images/search?q=" + query + "&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3"
soup = get_soup(url)
bimg = re.compile('.*mm.bing.net.*')
img_links = soup.find_all("img", {'src': bimg})
for link in img_links:
    print link

Tweaked your regex a bit

<img src="http://ts3.mm.bing.net/th?q=Rabbit&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cow&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Tiger&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Elephant&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Fish&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Fox&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Animal&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Chicken+Bird&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Domestic+Sheep&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Giraffe&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Puppy&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Dolphin&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Pet&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Baby+Birds&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts4.mm.bing.net/th?q=Labrador+Retriever&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Chihuahua&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Cat&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts3.mm.bing.net/th?q=Lion&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts1.mm.bing.net/th?q=Zebra&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>
<img src="http://ts2.mm.bing.net/th?q=Bulldog&amp;w=50&amp;h=50&amp;c=1&amp;pid=1.7&amp;mkt=en-CA&amp;adlt=moderate&amp;t=1"/>

Upvotes: 0

Zach Gates
Zach Gates

Reputation: 283

import urllib, bs4
from bs4 import *

url = "http://www.bing.com/images/search?q=%s&qft=+filterui:color2-bw+filterui:imagesize-large&FORM=R5IR3" % 'doggy'

html_page = urllib.urlopen(url)
soup = BeautifulSoup(html_page)

links = soup.find_all("img")

img_links = []

for link in links:
    img_links.append(str(link.get('src')))

for x in range(0, 10):  
    for x in range(0, len(img_links)):
        try:
            if "http://" in img_links[x]:
                pass
            else:
                del img_links[x]
        except:
            break

Try this.

The links should be in the list img_links.

Upvotes: 0

Related Questions