Reputation: 27

Exact website links from google through BeautifulSoup

I want to search google using BeautifulSoup and open the first link. But when I opened the link it shows error. The reason i think is that because google is not providing exact link of website, it has added several parameters in url. How to get exact url?

When i tried to use cite tag it worked but for big urls its creating problem.

The first link which i get using soup.h3.a['href'][7:] is: 'http://www.wikipedia.com/wiki/White_holes&sa=U&ved=0ahUKEwi_oYLLm_rUAhWJNI8KHa5SClsQFggbMAI&usg=AFQjCNGN-vlBvbJ9OPrnq40d0_b8M0KFJQ'

Here is my code:

import requests
from bs4 import Beautifulsoup
r = requests.get('https://www.google.com/search?q=site:wikipedia.com+Black+hole&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw')
soup = BeautifulSoup(r.text, "html.parser")
print(soup.h3.a['href'][7:])

Upvotes: 1

Answers (3)

Dmitriy Zub

Reputation: 1724

It's much simpler. You're looking for this:

# instead of this:
soup.h3.a['href'][7:].split('&')

# use this:
soup.select_one('.yuRUbf a')['href']

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "site:wikipedia.com black hole",     # query
  "gl": "us",                               # country to search from
  "hl": "en"                                # language    
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

first_link = soup.select_one('.yuRUbf a')['href']
print(first_link)

# https://en.wikipedia.com/wiki/Primordial_black_hole

Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The difference in your case is that you only need to extract the data from the structured JSON rather than figuring out why things don't work and then maintain it over time if some selectors will change.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "site:wikipedia.com black hole",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

# [0] - first index of search results
first_link = results['organic_results'][0]['link']
print(first_link)

# https://en.wikipedia.com/wiki/Primordial_black_hole

Disclaimer, I work for SerpApi.

Upvotes: 0

Mr.Bones

Reputation: 157

hope by clubbing all answer together presented above ,your code will look like this:

from bs4 import BeautifulSoup
import requests
import csv
import os
import time

url = "https://www.google.co.in/search?q=site:wikipedia.com+Black+hole&dcr=0&gbv=2&sei=Nr3rWfLXMIuGvQT9xZOgCA"
r = requests.get(url)
data = r.text

url1 = "https://www.google.co.in"

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("div", attrs={"class":"g"})
final_data = []
for details in get_details:
    link = details.find_all("h3")
    #links = ""
    for mdetails in link:
        links = mdetails.find_all("a")
        lmk = ""
        for lnk in links:
            lmk = lnk.get("href")[7:].split("&")
            sublist = []
            sublist.append(lmk[0])
        final_data.append(sublist)

filename = "Google.csv"
with open("./"+filename, "w")as csvfile:
    csvfile = csv.writer(csvfile, delimiter=",")
    csvfile.writerow("")
    for i in range(0, len(final_data)):
        csvfile.writerow(final_data[i])

Upvotes: 0

l'L'l

Reputation: 47169

You could split the returned string:

url = soup.h3.a['href'][7:].split('&')
print(url[0])

Upvotes: 1

Exact website links from google through BeautifulSoup

Answers (3)

Related Questions