Reputation: 27
I want to search google using BeautifulSoup and open the first link. But when I opened the link it shows error. The reason i think is that because google is not providing exact link of website, it has added several parameters in url. How to get exact url?
When i tried to use cite tag it worked but for big urls its creating problem.
The first link which i get using soup.h3.a['href'][7:] is: 'http://www.wikipedia.com/wiki/White_holes&sa=U&ved=0ahUKEwi_oYLLm_rUAhWJNI8KHa5SClsQFggbMAI&usg=AFQjCNGN-vlBvbJ9OPrnq40d0_b8M0KFJQ'
Here is my code:
import requests
from bs4 import Beautifulsoup
r = requests.get('https://www.google.com/search?q=site:wikipedia.com+Black+hole&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw')
soup = BeautifulSoup(r.text, "html.parser")
print(soup.h3.a['href'][7:])
Upvotes: 1
Views: 929
Reputation: 1724
It's much simpler. You're looking for this:
# instead of this:
soup.h3.a['href'][7:].split('&')
# use this:
soup.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "site:wikipedia.com black hole", # query
"gl": "us", # country to search from
"hl": "en" # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
first_link = soup.select_one('.yuRUbf a')['href']
print(first_link)
# https://en.wikipedia.com/wiki/Primordial_black_hole
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data from the structured JSON rather than figuring out why things don't work and then maintain it over time if some selectors will change.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "site:wikipedia.com black hole",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first index of search results
first_link = results['organic_results'][0]['link']
print(first_link)
# https://en.wikipedia.com/wiki/Primordial_black_hole
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 157
hope by clubbing all answer together presented above ,your code will look like this:
from bs4 import BeautifulSoup
import requests
import csv
import os
import time
url = "https://www.google.co.in/search?q=site:wikipedia.com+Black+hole&dcr=0&gbv=2&sei=Nr3rWfLXMIuGvQT9xZOgCA"
r = requests.get(url)
data = r.text
url1 = "https://www.google.co.in"
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("div", attrs={"class":"g"})
final_data = []
for details in get_details:
link = details.find_all("h3")
#links = ""
for mdetails in link:
links = mdetails.find_all("a")
lmk = ""
for lnk in links:
lmk = lnk.get("href")[7:].split("&")
sublist = []
sublist.append(lmk[0])
final_data.append(sublist)
filename = "Google.csv"
with open("./"+filename, "w")as csvfile:
csvfile = csv.writer(csvfile, delimiter=",")
csvfile.writerow("")
for i in range(0, len(final_data)):
csvfile.writerow(final_data[i])
Upvotes: 0
Reputation: 47169
You could split the returned string:
url = soup.h3.a['href'][7:].split('&')
print(url[0])
Upvotes: 1