Reputation: 2407
I am pretty new with BeautifulSoup
. I am trying to print image links from http://www.bing.com/images?q=owl:
redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl")
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
for div in productDivs:
print div.find('a')['t1'] #works fine
print div.find('img')['src'] #This getting issue KeyError: 'src'
But this gives only title, not the image source Is there anything wrong?
Edit: I have edited my source, still could not get image url.
Upvotes: 2
Views: 100
Reputation: 6950
Bing is using some techniques to block automated scrapers. I tried to print
div.find('img')
and found that they are sending source in attribute names src2, so following should work -
div.find('img')['src2']
This is working for me. Hope it helps.
Upvotes: 1
Reputation: 473873
If you open up browser develop tools, you'll see that there is an additional async XHR request issued to the http://www.bing.com/images/async
endpoint which contains the image search results.
Which leads to the 3 main options you have:
simulate that XHR request in your code. You might want to use something more suitable for humans than urllib2
; see requests
module. This would be so called "low-level" approach, going down to the bare metal and web-site specific implementation which would make this option non-reliable, difficult, "heavy", error-prompt and fragile
automate a real browser using selenium
- stay on the high-level. In other words, you don't care how the results are retrieved, what requests are made, what javascript needs to be executed. You just wait for search results to appear and extract them.
use Bing Search API (this should probably be option #1)
Upvotes: 1