Petr Petrov
Petr Petrov

Reputation: 4442

Python: parse links from Google with search

I need to parse links with results after search in Google. When I try to see code of page and Ctrl + U I can't find element with links, what I want. But When I see code of elements with Ctrl + Shift + I I can see what elem should I parse to get links. I use code

url = 'https://www.google.ru/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=' + str(query)
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')

But it returns empty list, becauses there are not this elements. I think that html-code, that returns requests.get(url).content isn't full, so I can't get this elements. I tried to use google.search but it returned error that it isn't used now. Is any way to get links with search in google?

Upvotes: 0

Views: 1396

Answers (3)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

In order to get the actual response that you see in the browser, you need to send additional headers, more specifically user-agent (aside from sending additional query parameters) which is needed to act as a "real" user visit when the bot or browser sends a fake user-agent string to announce themselves as a different client.

That's why you were getting an empty output because you received a different HTML with different elements (CSS selectors, ID's, and so on).

You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.

Pass user-agent:

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

requests.get('URL', headers=headers)

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}

params = {
  'q': 'minecraft', # query
  'gl': 'us',       # country to search from
  'hl': 'en',       # language
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link, sep='\n')

---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''

Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to create it from scratch and maintain it over time if something crashes.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "minecraft",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

-------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''

Disclaimer, I work for SerpApi.

Upvotes: 0

Hui-Yu Lee
Hui-Yu Lee

Reputation: 979

Try:

url = 'https://www.google.ru/search?q=' + str(query)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
links = soup.findAll('cite')
print([link.text for link in links])

For installing lxml, please see http://lxml.de/installation.html

*note: The reason I choose lxml instead html.parser is that sometimes I got incomplete result with html.parser and I don't know why

Upvotes: 1

Mithilesh Gupta
Mithilesh Gupta

Reputation: 2930

USe:

url = 'https://www.google.ru/search?q=name&rct=' + str(query)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')

Upvotes: 1

Related Questions