kowal666
kowal666

Reputation: 61

Google Scraping href values

I have problem with find href values in BeautifulSoup`

from urllib import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("https://www.google.pl/search?q=sprz%C4%99t+dla+graczy&client=ubuntu&ei=4ypXWsi_BcLZwQKGroW4Bg&start=0&sa=N&biw=741&bih=624")
bsObj = BeautifulSoup(html)
for link in bsObj.find("h3", {"class":"r"}).findAll("a"):
  if 'href' in link.attrs:
    print(link.attrs['href'])

all the time I have error:

"AttributeError: 'NoneType' object has no attribute 'findAll'

Upvotes: 1

Views: 326

Answers (1)

t.m.adam
t.m.adam

Reputation: 15376

You'll have to change the User-Agent string to something other than urllib's default user agent.

from urllib2 import urlopen, Request
from bs4 import BeautifulSoup

url = "https://www.google.pl/search?q=sprz%C4%99t+dla+graczy&client=ubuntu&ei=4ypXWsi_BcLZwQKGroW4Bg&start=0&sa=N&biw=741&bih=624"
html = urlopen(Request(url, headers={'User-Agent':'Mozilla/5'})).read()
bsObj = BeautifulSoup(html, 'html.parser')

for link in bsObj.find("h3", {"class":"r"}).findAll("a", href=True):
    print(link['href'])

Also note that this expression will select only the first link. If you want to select all the links in the page use the following expression:

links = bsObj.select("h3.r a[href]")
for link in links:
    print(link['href'])

Upvotes: 4

Related Questions