HStoltz
HStoltz

Reputation: 183

Python: scraping google results for websites' main URL and title

I am trying to scrape a given number results from google search, but I so far I came across two problems: one is that I don't know how to join the URLs and the titles inside the same loop, so they can be shown together in the format:

(Title)
(Website URL)
(---------)
(Title)
(Website URL)
(---------)

I somehow managed to achieve this format, but the loop is going on several times, instead of just showing the top 10 results. I believe it's something to do with how I structured the loops to work together, but I don't know how to avoid this.

The other problem is that I want to obtain both main URL and title of each website within search results, but while I managed to get the right titles, I seem to be getting many links coming from the same website, instead of only the main URL. For instance, if I search for "data science", the second or third title shown is from Coursera, while the link is from wikipedia. I only want the main URL so the title matches the website URL, how do I get it?

Any input will be greatly appreciated

import requests
from bs4 import BeautifulSoup
import re

query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")

requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
soup_title = BeautifulSoup(requests_results.text,"html.parser")
links = soup_link.find_all("a")
heading_object=soup_title.find_all( 'h3' )

for link in links:
  for info in heading_object:
    get_title = info.getText()
    link_href = link.get('href')
    if "url?q=" in link_href and not "webcache" in link_href:
      print(get_title)
      print(link.get('href').split("?q=")[1].split("&sa=U")[0])
      print("------")

Upvotes: 0

Views: 5178

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

Try to use requests params as a dict, it's more readable e.g:

params = {
  "q": "fus ro dah", 
  "hl": "en",
  "gl": "us",
  "num": "100"
}

requests.get('https://www.google.com/search', params=params)

Make sure you're using request headers and passing user-agent to act as a real user-visit. Otherwise Google will block your request eventually because default requests user-agent is python-requests. Check what's your user-agent.

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

You don't need to create several soups (BeautifulSoup() object), create only one instead and call it whenever it's needed. CSS selectors reference.

soup = BeautifulSoup(html.text, 'YOUR PARSER OF CHOISE') # try to use 'lxml', it's one of the fastest

# call it
soup.select()
soup.findAll()
soup.a.tag_parent
soup.p.next_element
for i in soup.select('css_selector'):
   some_variable = i.select_one('css_selector')

Code and full example in the one IDE:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'data science',
  'hl': 'en',
  'num': '100'
}

html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# container with all needed data
for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']
  displayed_link = result.select_one('.TbwUpd.NJjxre').text
  try:
    snippet = result.select_one('#rso .lyLwlc').text
  except: snippet = None

  print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')
  print('---------------')

'''
Data Science Specialization - Coursera
https://www.coursera.org/specializations/jhu-data-science
https://www.coursera.org › ... › Data Analysis
Offered by Johns Hopkins University. Launch Your Career in Data Science. A ten-course introduction to data science, developed and taught by .
---------------
'''

Alternatively, you can do the same thing using Google Organic Results API from SerpAPI. It's a paid API with a free plan.

The main difference is that you only need to iterate over structured JSON and get the data you want without figuring out how to select certain elements and extract data from there or bypass Google blocks if they'll appear or if you don't want to deal with JavaScript websites, e.g. Google Maps.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"), # serpapi API key
  "engine": "google",              # search engine
  "q": "data science",             # search query
  "hl": "en"                       # language of the search
}

search = GoogleSearch(params)      # where data extraction happens
results = search.get_dict()        # JSON -> Python dictionary

for result in results['organic_results']:
    title = result['title']
    link = result['link']
    displayed_link = result['displayed_link']
    snippet = result['snippet']

print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")
print('---------------')

'''
Data science - Wikipedia
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org › wiki › Data_science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured ...
---------------
'''

Disclaimer, I work for SerpApi.

Upvotes: 2

Jerven Clark
Jerven Clark

Reputation: 1219

The length of your links doesn't seem to match your heading_object list. I think it's best if you filter it further than just "a".

Editing your solution, you can loop through links like this:

import requests
from bs4 import BeautifulSoup
import re

query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")

requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
links = soup_link.find_all("a")

for link in links:
    link_href = link.get('href')
    if "url?q=" in link_href and not "webcache" in link_href:
      title = link.find_all('h3')
      if len(title) > 0:
          print(link.get('href').split("?q=")[1].split("&sa=U")[0])
          print(title[0].getText())
          print("------")

Instead of keeping 2 lists for headers and links, we can get the header directly from the link. We do that by by doing another find_all('h3') inside the link object. Since there are links that match url?q= format but are not part of the actual results you want to display, like the expanding accordion for related searches etc, we need to filter those out too. We can do that by checking if they have an "h3" header that's why we have len(title) > 0.

Upvotes: 3

Related Questions