Logs
Logs

Reputation: 121

How to access top five Google result links using Beautifulsoup

I want to access the top five(or any specified number) of links of results from Google. Through research, I found and modified the following code.

import requests
from bs4 import BeautifulSoup
import re    
search = raw_input("Search:")
page = requests.get("https://www.google.com/search?q=" + search)
soup = BeautifulSoup(page.content, "lxml")
links = soup.find("a")
print links.get('href')

This returns the first link on the page, which seems to be the Google images tab every time.

This is not completely what I want. For starters, I don't want the links of any google sites, just the results. Also, I want the first three or five or any specified number of results.

How can I use python to do this?

Thanks ahead of time!

Upvotes: 6

Views: 5828

Answers (5)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

If you're using find_all() then you can simply use limit argument without doing list slicing as myfashionhub suggested (but if you're using select(), then slicing is needed):

soup.findAll('div', {'class': '_NId'})[:5]
šŸ —šŸ —šŸ —
soup.findAll('div', {'class': '_NId'}, limit=5)

As other in the answers mentioned, you're looking for all <a> tags from the whole HTML. You're looking for this instead (CSS selectors reference, SelectorGadget extension to grab CSS selectors):

links = soup.find("a") # returns all <a> tags from the HTML

# select container with needed elements and grab each element in a loop
for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']

Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.

I wrote a dedicated blog about how to reduce chance of being blocked while web scraping search engines that cover multiple solutions.

Pass user-agent in request headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samurai cop what does katana mean",
  "gl": "us",
  "hl": "en",
  "num": "100"
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc')[:5]:
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']

  print(title, link, sep='\n')

--------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with selecting correct selectors or figuring out why certain things don't work as expected. Instead, you only need to iterate over structured JSON and get the date you want.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "samurai cop what does katana mean",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"][:5]:
  print(result['title'])
  print(result['link'])

---------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

Pedro Lobito
Pedro Lobito

Reputation: 99001

You can use:

import requests
from bs4 import BeautifulSoup
import re
search = input("Search:")
results = 100 # valid options 10, 20, 30, 40, 50, and 100
page = requests.get(f"https://www.google.com/search?q={search}&num={results}")
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
for link in links :
    link_href = link.get('href')
    if "url?q=" in link_href and not "webcache" in link_href:
        print (link.get('href').split("?q=")[1].split("&sa=U")[0])

Google Search Demo

For duckduckgo.com use:

import requests
from bs4 import BeautifulSoup
import re
search = input("Search:")
h = {"Host":"duckduckgo.com", "Origin": "https://duckduckgo.com", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0"}
d = {"q":search}
page = requests.post(f"https://duckduckgo.com/html/", data=d, headers=h)
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a", {"class": "result__a"})
for link in links :
    link_href = link.get('href')
    if not "https://duckduckgo.com" in link_href:
        print(link_href)

Upvotes: 9

Daniel Ocando
Daniel Ocando

Reputation: 3784

You could try the code below:

import bs4, requests
headers = {'User-Agent':
       'MAKE A GOOGLE SEARCH FOR MY USER AGENT AND PASTE IT HERE'}
search="test"
address='http://www.google.com/search?q='+search
res=requests.get(address,headers=headers)
soup=bs4.BeautifulSoup(res.text,'html.parser')
links=soup.select('div.r a')

l = [] #Empty list to display only the top 5 links

#Clean the soup by filtering only the information requested
for link in links:
  if "webcache.googleusercontent.com" in link.attrs["href"]:
    pass
  elif "#" in link.attrs["href"]:
    pass
  elif "/search?q=related:" in link.attrs["href"]:
    pass
  else:
    l.append(link.attrs["href"])

for i in range(5):
  print(l[i])

Make sure to replace your User-Agent information as suggested.

Upvotes: 0

Mandrax
Mandrax

Reputation: 11

An old question but it may help someone later... you can specify the results number with 'start' (multiple of 10 as on a results page) and insert it in a loop. Below an example to get first 200 results. Mind the string conversion.

s='AAPL'
for mypage in range(0, 200, 10):
    myurl="http://www.google.com/search?q="+s+"&start="+str(mypage)

Bonus: notice you can also specify language with 'hl': en (English), fr (French), etc

myurl="http://www.google.com/search?hl=fr&q="+s+"&start="+str(mypage)

Upvotes: 1

myfashionhub
myfashionhub

Reputation: 435

Be more specific with you selector. Note that the result divs have this class "_NId". So choose the first link inside that div.

result_divs = soup.findAll('div', {'class': '_NId'})[:4]
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]

Upvotes: 1

Related Questions