Unable to scrape all the links from Google Search Page by Web Scraping

I'm a beginner in web scraping. Recently I have tried scraping domains from search results from Google SERP.

To accomplish this I employed Requests, Beautiful Soup and Regex to fetch page, parse through tags and look into href and using regex match to extract domain names.

While doing this some links are missing in the output. The problem seems to be that requests is not fetching the page completely as I compared the fetched text with the source code on Chrome (The missing tags are present in that missing code). I wonder what the reason could be!

import requests
from bs4 import BeautifulSoup
import re

url = "https://www.google.com/search?q=glass+beads+india"
r = requests.get(url)
page = r.text 
soup = BeautifulSoup(page, 'lxml') 

i = 0

link_list = []
for tag in soup.find_all('a'):
    i+=1
    href = tag['href']
    if re.search('http',href):
        try:
            link = re.search('https://.+\.com',href).group(0)
            link_list.append(link)
        except:
            pass

link_list = list(set(link_list))

link_list2 = [] 

for link in link_list:
    if not re.search('google.com',link):
        link_list2.append(link)
        
print(link_list2)

Upvotes: 1

Answers (1)

Dmitriy Zub

Reputation: 1724

It could be because you didn't specify a user-agent aka requests headers, thus Google will block a request and you receive a page with an error message or something similar. Check what is your user-agent.

Pass a user-agent:

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('YOUR URL', headers=headers)

Find all links using SelectorGadget Chrome extension to grab CSS selectors (CSS selectors reference):

# container with all needed data
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  displayed_link = result.select_one('.TbwUpd.NJjxre').text

Match domain and subdomain excluding "www." part:

>>> re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link)
'etsy.com'

Code and full example in the online IDE:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'glass beads india',  # search query
  'hl': 'en',                # language
  'num': '100'               # number of results
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

# container with all needed data
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  displayed_link = result.select_one('.TbwUpd.NJjxre').text

  # https://stackoverflow.com/a/25703406/15164646
  domain_name = ''.join(re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link))

  print(link)
  print(displayed_link)
  print(domain_name)
  print('---------------')


'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''

Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.

The main difference is that you only need to iterate and extract data from structured JSON.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"), # environment variable
  "engine": "google",
  "q": "glass beads india",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    link = result['link']
    displayed_link = result['displayed_link']
    domain_name = ''.join(re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link))
    
    print(link)
    print(displayed_link)
    print(domain_name)
    print('---------------')


'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''

Disclaimer I work for SerpApi.

Upvotes: 1

Unable to scrape all the links from Google Search Page by Web Scraping

Answers (1)

Related Questions