Reputation: 4442
I need to parse links with results after search in Google.
When I try to see code of page and Ctrl + U
I can't find element with links, what I want.
But When I see code of elements with
Ctrl + Shift + I
I can see what elem should I parse to get links.
I use code
url = 'https://www.google.ru/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=' + str(query)
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
But it returns empty list, becauses there are not this elements.
I think that html-code
, that returns requests.get(url).content
isn't full, so I can't get this elements.
I tried to use google.search
but it returned error that it isn't used now.
Is any way to get links with search in google?
Upvotes: 0
Views: 1396
Reputation: 1724
In order to get the actual response that you see in the browser, you need to send additional headers, more specifically user-agent
(aside from sending additional query parameters) which is needed to act as a "real" user visit when the bot or browser sends a fake user-agent
string to announce themselves as a different client.
That's why you were getting an empty output because you received a different HTML with different elements (CSS
selectors, ID
's, and so on).
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent
:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'minecraft', # query
'gl': 'us', # country to search from
'hl': 'en', # language
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it over time if something crashes.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "minecraft",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 979
Try:
url = 'https://www.google.ru/search?q=' + str(query)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
links = soup.findAll('cite')
print([link.text for link in links])
For installing lxml
, please see http://lxml.de/installation.html
*note: The reason I choose lxml
instead html.parser
is that sometimes I got incomplete result with html.parser and I don't know why
Upvotes: 1
Reputation: 2930
USe:
url = 'https://www.google.ru/search?q=name&rct=' + str(query)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
Upvotes: 1