Silka
Silka

Reputation: 304

How to print Google Search results properly with bs4?

I have a working code, that prints firstly search titles and then urls but it prints a lot of urls between website titles. But how to print them in format like this and avoid printing the same urls 10 times for each:

1) Title url
2) Title url
and so on... 

My code:

search = input("Search:")

page = requests.get(f"https://www.google.com/search?q=" + search)

soup = BeautifulSoup(page.content, "html5lib")

links = soup.findAll("a")

heading_object = soup.find_all('h3')

for info in heading_object:
    x = info.getText()
    print(x)
    for link in links:
        link_href = link.get('href')
        if "url?q=" in link_href:
            y = (link.get('href').split("?q=")[1].split("&sa=U")[0])
            print(y)

Upvotes: -1

Views: 366

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

You're looking for this:

for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  url = result.a['href']
  print(f'{title}, {url}\n') # prints TITLE, URL followed by a new line.

If you're using f-string then the appropriate way is to use it like so:

page = requests.get(f"https://www.google.com/search?q=" + search) # not proper f-string
page = requests.get(f"https://www.google.com/search?q={search}")  # proper f-string

Code:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
  'User-agent':
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "python memes",
  "hl": "en"
}

soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')

for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  url = result.a['href']
  print(f'{title}, {url}\n')

--------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/

ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en

28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''

Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

One of the differences is that you only need to iterate over JSON rather than figuring out how to scrape stuff.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "python memes",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  url = result['link']
  print(f'{title}, {url}\n')

-------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/

ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en

28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

furas
furas

Reputation: 142889

If you get separatelly titles and links then you can use zip() to group them in pairs

for info, link in zip(heading_object, links):
    info = info.getText()

    link = link.get('href')
    if "?q=" in link:
        link = link.split("?q=")[1].split("&sa=U")[0]

    print(info, link)

But this may have problem when some title or link doesn't exist on page because then it will create wrong pairs. It will pair title with link for next element. You should rather search elements which keep both title and link and inside every element search single title and single link to create pair. If there is no title or link then you can put some default value and it will not create wrong pairs.

Upvotes: 1

Related Questions