Reputation: 1434
I would like to periodically check what sub-domains are being listed by Google.
To obtain list of sub-domains, I type 'site:example.com' in Google search box - this lists all the sub-domain results (over 20 pages for our domain).
What is the best way to extract only the URL of the addresses returned by the 'site:example.com' search?
I was thinking of writing a little python script that will do the above search and regex the URLs from the search results (repeat on all result pages). Is this a good start? Could there be a better methodology?
Cheers.
Upvotes: 4
Views: 13692
Reputation: 1724
Another way of doing it using requests
, bs4
:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'site:minecraft.fandom.com'}
html = requests.get(f'https://www.google.com/search?q=',
headers=headers,
params=params).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
link = container.find('a')['href']
print(link)
Output:
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
To get these results from each page using pagination:
from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml
def print_extracted_data_from_url(url):
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
print(f'Current URL: {url}')
print()
for container in soup.findAll('div', class_='tF2Cxc'):
head_link = container.a['href']
print(head_link)
return soup.select_one('a#pnnext')
def scrape():
next_page_node = print_extracted_data_from_url(
'https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com')
while next_page_node is not None:
next_page_url = urllib.parse.urljoin('https://www.google.com', next_page_node['href'])
next_page_node = print_extracted_data_from_url(next_page_url)
scrape()
Part of the output:
Results via beautifulsoup
Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=site:minecraft.fandom.com
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "site:minecraft.fandom.com",
"api_key": os.getenv('API_KEY')
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
Output:
https://minecraft.fandom.com/wiki/Podzol
https://minecraft.fandom.com/wiki/Pumpkin
https://minecraft.fandom.com/wiki/Swimming
https://minecraft.fandom.com/wiki/Polished_Blackstone
https://minecraft.fandom.com/wiki/Nether_Quartz_Ore
https://minecraft.fandom.com/wiki/Blacksmith
https://minecraft.fandom.com/wiki/Grindstone
https://minecraft.fandom.com/wiki/Spider
https://minecraft.fandom.com/wiki/Crash
https://minecraft.fandom.com/wiki/Tuff
Using pagination:
import os
from serpapi import GoogleSearch
def scrape():
params = {
"engine": "google",
"q": "site:minecraft.fandom.com",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
print(f"Current page: {results['serpapi_pagination']['current']}")
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
while 'next' in results['serpapi_pagination']:
search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
results = search.get_dict()
print(f"Current page: {results['serpapi_pagination']['current']}")
for result in results["organic_results"]:
print(f"Title: {result['title']}\nLink: {result['link']}\n")
scrape()
Disclaimer, I work for SerpApi.
Upvotes: 0
Reputation: 9444
Regex is a bad idea for parsing HTML. It's cryptic to read and relies of well-formed HTML.
Try BeautifulSoup for Python. Here's an example script that returns URLs from the first 10 pages of a site:domain.com Google query.
import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document
if __name__ == "__main__":
### Import Beautiful Soup
### Here, I have the BeautifulSoup folder in the level of this Python script
### So I need to tell Python where to look.
sys.path.append("./BeautifulSoup")
from BeautifulSoup import BeautifulSoup
### Create opener with Google-friendly user agent
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
### Open page & generate soup
### the "start" variable will be used to iterate through 10 pages.
for start in range(0,10):
url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
page = opener.open(url)
soup = BeautifulSoup(page)
### Parse and find
### Looks like google contains URLs in <cite> tags.
### So for each cite tag on each page (10), print its contents (url)
for cite in soup.findAll('cite'):
print cite.text
Output:
stackoverflow.com/
stackoverflow.com/questions
stackoverflow.com/unanswered
stackoverflow.com/users
meta.stackoverflow.com/
blog.stackoverflow.com/
chat.meta.stackoverflow.com/
...
Of course, you could append each result to a list so you can parse it for subdomains. I just got into Python and scraping a few days ago, but this should get you started.
Upvotes: 16
Reputation: 390
The Google Custom Search API can deliver results in ATOM XML format
Getting Started with Google Custom Search
Upvotes: 3