Reputation: 121
I want to access the top five(or any specified number) of links of results from Google. Through research, I found and modified the following code.
import requests
from bs4 import BeautifulSoup
import re
search = raw_input("Search:")
page = requests.get("https://www.google.com/search?q=" + search)
soup = BeautifulSoup(page.content, "lxml")
links = soup.find("a")
print links.get('href')
This returns the first link on the page, which seems to be the Google images tab every time.
This is not completely what I want. For starters, I don't want the links of any google sites, just the results. Also, I want the first three or five or any specified number of results.
How can I use python to do this?
Thanks ahead of time!
Upvotes: 6
Views: 5828
Reputation: 1724
If you're using find_all()
then you can simply use limit
argument without doing list
slicing as myfashionhub suggested (but if you're using select()
, then slicing is needed):
soup.findAll('div', {'class': '_NId'})[:5]
š š š
soup.findAll('div', {'class': '_NId'}, limit=5)
As other in the answers mentioned, you're looking for all <a>
tags from the whole HTML. You're looking for this instead (CSS
selectors reference, SelectorGadget extension to grab CSS
selectors):
links = soup.find("a") # returns all <a> tags from the HTML
# select container with needed elements and grab each element in a loop
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
Make sure you're using user-agent
because default requests
user-agent
is python-requests
thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent
fakes user visit by adding this information into HTTP request headers.
I wrote a dedicated blog about how to reduce chance of being blocked while web scraping search engines that cover multiple solutions.
Pass user-agent
in request headers
:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "us",
"hl": "en",
"num": "100"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc')[:5]:
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(title, link, sep='\n')
--------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with selecting correct selectors or figuring out why certain things don't work as expected. Instead, you only need to iterate over structured JSON and get the date you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result['title'])
print(result['link'])
---------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
"It means "Japanese sword"... 2 minute review of a ... - Reddit
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
What does Katana mean? - Samurai Cop quotes - Subzin.com
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 99001
You can use:
import requests
from bs4 import BeautifulSoup
import re
search = input("Search:")
results = 100 # valid options 10, 20, 30, 40, 50, and 100
page = requests.get(f"https://www.google.com/search?q={search}&num={results}")
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
for link in links :
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print (link.get('href').split("?q=")[1].split("&sa=U")[0])
For duckduckgo.com
use:
import requests
from bs4 import BeautifulSoup
import re
search = input("Search:")
h = {"Host":"duckduckgo.com", "Origin": "https://duckduckgo.com", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0"}
d = {"q":search}
page = requests.post(f"https://duckduckgo.com/html/", data=d, headers=h)
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a", {"class": "result__a"})
for link in links :
link_href = link.get('href')
if not "https://duckduckgo.com" in link_href:
print(link_href)
Upvotes: 9
Reputation: 3784
You could try the code below:
import bs4, requests
headers = {'User-Agent':
'MAKE A GOOGLE SEARCH FOR MY USER AGENT AND PASTE IT HERE'}
search="test"
address='http://www.google.com/search?q='+search
res=requests.get(address,headers=headers)
soup=bs4.BeautifulSoup(res.text,'html.parser')
links=soup.select('div.r a')
l = [] #Empty list to display only the top 5 links
#Clean the soup by filtering only the information requested
for link in links:
if "webcache.googleusercontent.com" in link.attrs["href"]:
pass
elif "#" in link.attrs["href"]:
pass
elif "/search?q=related:" in link.attrs["href"]:
pass
else:
l.append(link.attrs["href"])
for i in range(5):
print(l[i])
Make sure to replace your User-Agent information as suggested.
Upvotes: 0
Reputation: 11
An old question but it may help someone later... you can specify the results number with 'start' (multiple of 10 as on a results page) and insert it in a loop. Below an example to get first 200 results. Mind the string conversion.
s='AAPL'
for mypage in range(0, 200, 10):
myurl="http://www.google.com/search?q="+s+"&start="+str(mypage)
Bonus: notice you can also specify language with 'hl': en (English), fr (French), etc
myurl="http://www.google.com/search?hl=fr&q="+s+"&start="+str(mypage)
Upvotes: 1
Reputation: 435
Be more specific with you selector. Note that the result divs have this class "_NId". So choose the first link inside that div.
result_divs = soup.findAll('div', {'class': '_NId'})[:4]
links = [div.find('a') for div in result_divs]
hrefs = [link.get('href') for link in links]
Upvotes: 1