user353829
user353829

Reputation: 1434

Extract Google Search Results

I would like to periodically check what sub-domains are being listed by Google.

To obtain list of sub-domains, I type 'site:example.com' in Google search box - this lists all the sub-domain results (over 20 pages for our domain).

What is the best way to extract only the URL of the addresses returned by the 'site:example.com' search?

I was thinking of writing a little python script that will do the above search and regex the URLs from the search results (repeat on all result pages). Is this a good start? Could there be a better methodology?

Cheers.

Upvotes: 4

Views: 13692

Answers (3)

Dmitriy Zub
Dmitriy Zub

Reputation: 1724

Another way of doing it using requests, bs4:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'site:minecraft.fandom.com'}

html = requests.get(f'https://www.google.com/search?q=',
                    headers=headers,
                    params=params).text
soup = BeautifulSoup(html, 'lxml')

for container in soup.findAll('div', class_='tF2Cxc'):
   link = container.find('a')['href']
   print(link)

Output:

https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff

To get these results from each page using pagination:

from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml

def print_extracted_data_from_url(url):

    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.get(url, headers=headers).text

    soup = BeautifulSoup(response, 'lxml')

    print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
    print(f'Current URL: {url}')
    print()

    for container in soup.findAll('div', class_='tF2Cxc'):
        head_link = container.a['href']
        print(head_link)

    return soup.select_one('a#pnnext')


def scrape():
    next_page_node = print_extracted_data_from_url(
        'https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com')

    while next_page_node is not None:
        next_page_url = urllib.parse.urljoin('https://www.google.com', next_page_node['href'])

        next_page_node = print_extracted_data_from_url(next_page_url)

scrape()

Part of the output:

Results via beautifulsoup

Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com

https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff

Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.

Code to integrate:

from serpapi import GoogleSearch
import os

params = {
  "engine": "google",
  "q": "site:minecraft.fandom.com",
  "api_key": os.getenv('API_KEY')
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  link = result['link']
  print(link)

Output:

https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff

Using pagination:

import os
from serpapi import GoogleSearch

def scrape():
  
  params = {
    "engine": "google",
    "q": "site:minecraft.fandom.com",
    "api_key": os.getenv("API_KEY"),
  }

  search = GoogleSearch(params)
  results = search.get_dict()

  print(f"Current page: {results['serpapi_pagination']['current']}")

  for result in results["organic_results"]:
      print(f"Title: {result['title']}\nLink: {result['link']}\n")

  while 'next' in results['serpapi_pagination']:
      search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
      results = search.get_dict()

      print(f"Current page: {results['serpapi_pagination']['current']}")

      for result in results["organic_results"]:
          print(f"Title: {result['title']}\nLink: {result['link']}\n")
scrape()

Disclaimer, I work for SerpApi.

Upvotes: 0

danneu
danneu

Reputation: 9444

Regex is a bad idea for parsing HTML. It's cryptic to read and relies of well-formed HTML.

Try BeautifulSoup for Python. Here's an example script that returns URLs from the first 10 pages of a site:domain.com Google query.

import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document

if __name__ == "__main__":
    ### Import Beautiful Soup
    ### Here, I have the BeautifulSoup folder in the level of this Python script
    ### So I need to tell Python where to look.
    sys.path.append("./BeautifulSoup")
    from BeautifulSoup import BeautifulSoup

    ### Create opener with Google-friendly user agent
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]

    ### Open page & generate soup
    ### the "start" variable will be used to iterate through 10 pages.
    for start in range(0,10):
        url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
        page = opener.open(url)
        soup = BeautifulSoup(page)

        ### Parse and find
        ### Looks like google contains URLs in <cite> tags.
        ### So for each cite tag on each page (10), print its contents (url)
        for cite in soup.findAll('cite'):
            print cite.text

Output:

stackoverflow.com/
stackoverflow.com/questions
stackoverflow.com/unanswered
stackoverflow.com/users
meta.stackoverflow.com/
blog.stackoverflow.com/
chat.meta.stackoverflow.com/
...

Of course, you could append each result to a list so you can parse it for subdomains. I just got into Python and scraping a few days ago, but this should get you started.

Upvotes: 16

CWMan
CWMan

Reputation: 390

The Google Custom Search API can deliver results in ATOM XML format

Getting Started with Google Custom Search

Upvotes: 3

Related Questions