Reputation: 2189
I am scraping google search results. However, I repeatedly get a SyntaxError while doing it. Here's the code:
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/70.0'
url = "https://www.google.com/search?hl=en&q=python+wikipedia"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read()
soup= BeautifulSoup(data, 'html.parser')
l = soup.find_all('h' , 'attrs' = {"class":'LC20lb'})
print(l)
I get :
SyntaxError: keyword can't be an expression
in the line l = soup.find_all('h' , 'attrs' = {"class":'LC20lb'})
. Can someone please tell me what I'm doing wrong?
Upvotes: 1
Views: 62
Reputation: 1724
Try to use requests
instead.
Try to use css
selectors, e.g select()
/select_one()
, they're more flexible and a bit more readable and a bit faster.
soup.select('.LC20lb') # equivalent to find_all()
Check out the SelectorGadget Chrome extension to grab CSS
selectors by clicking on the desired element in the browser.
Also, you don't have to specify the class
attribute in find_all()
, e.g:
soup.find_all('h3', 'LC20lb') # returs a list of titles
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "python wikipedia"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with all titles
for result in soup.select('.tF2Cxc'):
# extracting each title from the container specifying what css selector title has
title = result.select_one('.DKV0Md').text
print(title)
-----
'''
Python (programming language) - Wikipedia
Python - Wikipedia
History of Python - Wikipedia
wikipedia 1.4.0 - PyPI
What is Python? Executive Summary
Python Wiki: FrontPage
BeginnersGuide/Programmers - Python Wiki
Wikipedia API for Python. In this tutorial let us understand the…
Wikipedia — wikipedia 0.9 documentation
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get what you want rather than figuring out how to parse stuff.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "python Wikipedia",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
title = result['title']
print(title)
------
'''
Python - Wikipedia
History of Python - Wikipedia
wikipedia 1.4.0 - PyPI
What is Python? Executive Summary
Python Wiki: FrontPage
BeginnersGuide/Programmers - Python Wiki
Wikipedia API for Python. In this tutorial let us understand the…
Wikipedia — wikipedia 0.9 documentation
'''
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 1029
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/70.0'
url = "https://www.google.com/search?hl=en&q=python+wikipedia"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read()
soup= BeautifulSoup(data, 'html.parser')
l = soup.find_all('h', {"class":'LC20lb'})
print(l)
Upvotes: 1
Reputation: 2433
There should not be the apostrophes around attrs:
l = soup.find_all('h' , attrs = {"class":'LC20lb'})
# not: _ _
#l = soup.find_all('h' , 'attrs' = {"class":'LC20lb'})
# ^ ^
Upvotes: 1