ablanch5
ablanch5

Reputation: 338

not getting correct url beautifulsoup python

I am trying to webscrape google search results using python and beautifulsoup. In my first program I'm just trying to get all links on the search result page. Ultimately what I want to do is follow the links to other websites and then scrape those websites. The problem is when I'm looking at the links my program is giving me they are not pointing to the correct url. For example the first website url after searching "what is python" in google is is 'https://www.python.org/doc/essays/blurb/' however my program is giving me '/url?q=https://www.python.org/doc/essays/blurb/&sa=U&ved=0ahUKEwirv7mZzNnbAhXD5YMKHdl0AFsQFggUMAA&usg=AOvVaw3Q2RD0gl-X3BiEJ-5HIxmF'

Reviewing the BeautifulSoup documentation I am expecting output similar to their example:

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

Instead I am getting a preceding '/url?q=' and lots of unpexcted characters after the website address. Can someone someone explain why I am not getting the expected output? Here is my code:

import requests
from bs4 import BeautifulSoup

search_item = 'what is python'
url = "https://www.google.ca/search?q=" + search_item

response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

Upvotes: 0

Views: 161

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

It's because there was no user-agent specified, and default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit. user-agent faking user visit by adding this information into HTTP request headers.


Also, you're not pinpointing the links you were looking for with this code, it will extract all links from the HTML:

for link in soup.find_all('a'):
    print(link.get('href'))

Instead, you're looking for links from organic results, e.g:

# container with needed data (title, link, snippet, displayed link, etc.)
for result in soup.select('.tF2Cxc'):
  # grabbing just links from the container
  link = result.select_one('.yuRUbf a')['href']

Code:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "what does katana mean",  # query
  "gl": "us",                    # country to search from
  "hl": "en"                     # language
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you only need to extract the data from the structured JSON you want fast, rather than figuring out why certain things don't work properly.

Code to integrate:


import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "what does katana mean",
    "hl": "en",
    "gl": "us",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['title'])
  print(result['link'])

Disclaimer, I work for SerpApi.

Upvotes: 1

ablanch5
ablanch5

Reputation: 338

I wanted to provide an update to this question. I found that by adding a header:

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                         'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 
Safari/537.36'}
r = requests.get(url, headers=headers)

that google provided me with the correct link and I did not have to do any manipulation to the string.

Upvotes: 0

Related Questions