Reputation: 25
I'm a beginner in web scraping. Recently I have tried scraping domains from search results from Google SERP.
To accomplish this I employed Requests, Beautiful Soup and Regex to fetch page, parse through tags and look into href and using regex match to extract domain names.
While doing this some links are missing in the output. The problem seems to be that requests is not fetching the page completely as I compared the fetched text with the source code on Chrome (The missing tags are present in that missing code). I wonder what the reason could be!
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.google.com/search?q=glass+beads+india"
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page, 'lxml')
i = 0
link_list = []
for tag in soup.find_all('a'):
i+=1
href = tag['href']
if re.search('http',href):
try:
link = re.search('https://.+\.com',href).group(0)
link_list.append(link)
except:
pass
link_list = list(set(link_list))
link_list2 = []
for link in link_list:
if not re.search('google.com',link):
link_list2.append(link)
print(link_list2)
Upvotes: 1
Views: 1786
Reputation: 1724
It could be because you didn't specify a user-agent
aka requests headers
, thus Google will block a request and you receive a page with an error message or something similar. Check what is your user-agent.
Pass a user-agent
:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('YOUR URL', headers=headers)
Find all links using SelectorGadget Chrome extension to grab CSS
selectors (CSS
selectors reference):
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
Match domain and subdomain excluding "www." part:
>>> re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link)
'etsy.com'
Code and full example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'glass beads india', # search query
'hl': 'en', # language
'num': '100' # number of results
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
# https://stackoverflow.com/a/25703406/15164646
domain_name = ''.join(re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link))
print(link)
print(displayed_link)
print(domain_name)
print('---------------')
'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''
Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The main difference is that you only need to iterate and extract data from structured JSON.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment variable
"engine": "google",
"q": "glass beads india",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
displayed_link = result['displayed_link']
domain_name = ''.join(re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link))
print(link)
print(displayed_link)
print(domain_name)
print('---------------')
'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''
Disclaimer I work for SerpApi.
Upvotes: 1