Reputation: 1685
I'm using Python 3. The code below is supposed to let the user enter a search term into the command line, after which it searches Google and runs through the HTML of the results page to find tags matching the CSS selector ('.r a').
Say we search for the term "cats." I know the tags I'm looking for exist on the "cats" search results page since I looked through the page source myself.
But when I run my code, the linkElems list is empty. What is going wrong?
import requests, sys, bs4
print('Googling...')
res = requests.get('http://google.com/search?q=' +' '.join(sys.argv[1:]))
print(res.raise_for_status())
soup = bs4.BeautifulSoup(res.text, 'html5lib')
linkElems = soup.select(".r a")
print(linkElems)
Upvotes: 0
Views: 677
Reputation: 1724
The parts you want to extract are not rendered by JavaScript as Matts mentioned and you don't need regex
for such a task.
Make sure you're using user-agent
otherwise Google will block your request eventually. That might be the reason why you were getting an empty output since you received a completely different HTML. Check what is your user-agent
. I already answered about what is user-agent
and HTTP headers.
Pass user-agent
into HTTP headers
:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
html5lib
is the slowest parser, try to use lxml
instead, it's way faster. If you want to use even faster parser, have a look at selectolax
.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with the parsing part, instead, you only need to iterate over structured JSON and get the data you want, plus you don't have to maintain the parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
link = result['link']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 1341
The ".r" class is rendered by Javascript, so it's not available in the HTML received. You can either render the javascript using selenium or similar or you can try a more creative solution to extracting the links from the tags. First check that the tags exist by finding them without the ".r" class. soup.find_all("a")
Then as an example you can use regex to extract all urls beginning with "/url?q="
import re
linkelems = soup.find_all(href=re.compile("^/url\?q=.*"))
Upvotes: 1