not getting correct url beautifulsoup python

Question

I am trying to webscrape google search results using python and beautifulsoup. In my first program I'm just trying to get all links on the search result page. Ultimately what I want to do is follow the links to other websites and then scrape those websites. The problem is when I'm looking at the links my program is giving me they are not pointing to the correct url. For example the first website url after searching "what is python" in google is is 'https://www.python.org/doc/essays/blurb/' however my program is giving me '/url?q=https://www.python.org/doc/essays/blurb/&sa=U&ved=0ahUKEwirv7mZzNnbAhXD5YMKHdl0AFsQFggUMAA&usg=AOvVaw3Q2RD0gl-X3BiEJ-5HIxmF'

Reviewing the BeautifulSoup documentation I am expecting output similar to their example:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

Instead I am getting a preceding '/url?q=' and lots of unpexcted characters after the website address. Can someone someone explain why I am not getting the expected output? Here is my code:

import requests
from bs4 import BeautifulSoup

search_item = 'what is python'
url = "https://www.google.ca/search?q=" + search_item

response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

ablanch5 · Accepted Answer

I wanted to provide an update to this question. I found that by adding a header:

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                         'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 
Safari/537.36'}
r = requests.get(url, headers=headers)

that google provided me with the correct link and I did not have to do any manipulation to the string.

not getting correct url beautifulsoup python

Answers (2)

Related Questions