Asker
Asker

Reputation: 1685

Beautiful Soup CSS selector not finding anything

I'm using Python 3. The code below is supposed to let the user enter a search term into the command line, after which it searches Google and runs through the HTML of the results page to find tags matching the CSS selector ('.r a').

Say we search for the term "cats." I know the tags I'm looking for exist on the "cats" search results page since I looked through the page source myself.

But when I run my code, the linkElems list is empty. What is going wrong?

    import requests, sys, bs4

    print('Googling...')
    res = requests.get('http://google.com/search?q='  +' '.join(sys.argv[1:]))
    print(res.raise_for_status())

    soup = bs4.BeautifulSoup(res.text, 'html5lib')
    linkElems = soup.select(".r a")
    print(linkElems)

Upvotes: 0

Views: 677

Answers (2)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

The parts you want to extract are not rendered by JavaScript as Matts mentioned and you don't need regex for such a task.

Make sure you're using user-agent otherwise Google will block your request eventually. That might be the reason why you were getting an empty output since you received a completely different HTML. Check what is your user-agent. I already answered about what is user-agent and HTTP headers.

Pass user-agent into HTTP headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get("YOUR_URL", headers=headers)

html5lib is the slowest parser, try to use lxml instead, it's way faster. If you want to use even faster parser, have a look at selectolax.


Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "selena gomez"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''

Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you don't have to deal with the parsing part, instead, you only need to iterate over structured JSON and get the data you want, plus you don't have to maintain the parser over time.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "selena gomez",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  link = result['link']
  print(link)

----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''

P.S - I wrote a blog post about how to scrape Google Organic Search Results.

Disclaimer, I work for SerpApi.

Upvotes: 0

Matts
Matts

Reputation: 1341

The ".r" class is rendered by Javascript, so it's not available in the HTML received. You can either render the javascript using selenium or similar or you can try a more creative solution to extracting the links from the tags. First check that the tags exist by finding them without the ".r" class. soup.find_all("a") Then as an example you can use regex to extract all urls beginning with "/url?q="

import re
linkelems = soup.find_all(href=re.compile("^/url\?q=.*"))

Upvotes: 1

Related Questions