Reputation: 799
I want to create a Google script to check if a given URL is indexed by Google, so I write the following function:
function CheckURLForGoogleIndex(url, activesheet) {// Delete the https:// and http:// prefix
var cururl = url.replace("https://", "");
cururl = cururl.replace("http://", "");
var googlesearchurl = "https://www.google.com/search?q=site:" + encodeURIComponent(cururl);
var page = UrlFetchApp.fetch(googlesearchurl, {muteHttpExceptions: true}).getContentText();
// Wait for 1 second before starting another fetch
Utilities.sleep(1000);
var number = page.match("did not match any documents");
if (number) {
activesheet.getSheetByName("Not Google Index").appendRow([url]);
} else {
activesheet.getSheetByName("Google Index").appendRow([url]);
}
}
However, when debugging the code, after invoking UrlFetchApp.fetch, I can only see the header of the variable page.
I try to test the function with a Google Indexed URL and not indexed URL, but both will return null in page.match function, so both are put in "Google Index" sheet.
What is the problem with my function?
Thanks
Note:
I have asked this question on https://groups.google.com/g/google-apps-script-community/c/gs1qUuKwgn4 but no one answers, so I have to ask here.
Sample Input & Output
Input1:
url = https://www.datanumen.com/
activesheet = a GoogleSheet which contains sheets "Google Index" and "Not Google Index"
Expected Output1: Since https://www.datanumen.com/ is indexed by Google, it will be added to "Google Index" sheet.
page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/ - Google Search…"
Input2:
url = https://www.datanumen.com/notindexedurl/
activesheet = a GoogleSheet which contains sheets "Google Index" and "Not Google Index"
Expected Output2: Since https://www.datanumen.com/notindexedurl/ is NOT indexed by Google, it will be added to "NOT Google Index" sheet.
page = "<!doctype html><html lang="en"><head><meta charset="UTF-8"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>site:www.datanumen.com/notindexurl/ - G…"
The problem is currently for both Input1 and Input2, the actual result is: the URL will always be added to "Google Index" sheet since the search result will NEVER be containing "did not match any documents" text at all.
Update
I add console.log(page); and debug again. For Input1, I get the following result:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://www.google.com/search?q=site:www.datanumen.com%2F</title></head>
<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById('captcha');if(e){e.focus();}">
<div style="max-width:400px;">
<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>
<form id="captcha-form" action="index" method="post">
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
<script>var submitCallback = function(response) {document.getElementById('captcha-form').submit();};</script>
<div id="recaptcha" class="g-recaptcha" data-sitekey="6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b" data-callback="submitCallback" data-s="c5Hy4maqTFv3SzYRiWhpsqYF2isZmauUQnLVljOiED_PiaVWJWCsHMzRAZyh8HLCBHJ_mjET7yODJu8AlZ33_xGAQ8TcKuXAd7rQpsYakaGKPD8USiGSFhiII2ai-Cf_B26i1Ufpko-qYQ8V3rezhiSXxi5J2yHZ-_WwEj8ukzy5znxzVurTM_2cY243Q4ofwP7E7eWBaHIg6N3ofmPuFXd-uRIUU4z0cU_pas8"></div>
<input type='hidden' name='q' value='EgRrsuB5GKmx6oUGIhBKAdWty9nssg-nAtyy9n7hMgFy'><input type="hidden" name="continue" value="https://www.google.com/search?q=site:www.datanumen.com%2F">
</form>
<hr noshade size="1" style="color:#ccc; background-color:#ccc;">
<div style="font-size:13px;">
<b>About this page</b><br><br>
Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. <a href="#" onclick="document.getElementById('infoDiv').style.display='block';">Why did this happen?</a><br><br>
<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">
This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href="//www.google.com/policies/terms/">Terms of Service</a>. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.<br><br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. <a href="//support.google.com/websearch/answer/86640">Learn more</a><br><br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.
</div>
IP address: 107.178.224.121<br>Time: 2021-06-04T21:18:34Z<br>URL: https://www.google.com/search?q=site:www.datanumen.com%2F<br>
</div>
</div>
</body>
</html>
Upvotes: 1
Views: 2016
Reputation: 15357
Unfortunately doing this directly by attempting to web scrape the search results using UrlFetchApp will not work. You can use third party tools to get the number of search results, however.
I tested this out using an exponential backoff method which sometimes is able to get past 429
errors when a fetch request is invoked by UrlFetchApp
.
When using UrlFetchApp
to either web scrape or to connect to an API, it can happen that the server denies the request on the grounds of too many requests
- or HTTP Error 429
.
Google Apps Script runs in the cloud, from a set of IP addresses in a pool that Google own. You can actually see all the IP ranges here. Most websites (especially large companies such as Google) have architecture in place to prevent the use of bots scraping their websites and slowing down traffic.
Sometimes it's possible to get past this error, using a mixture of exponential backoff and random time intervals as shown for the Binance API (Full Disclosure: this GitHub repository was written by me.)
I assume that either Google directly blocks the Apps Script IP pool, or there are simply too many people trying the same thing - because with the same techniques I was unable to get any response that didn't involve entering a captcha as we discussed in the comments above and can be seen in the log of the page
string.
There are many third party APIs that you can use to do this, and I suggest searching for one that meets your needs.
I tested out one called Authoritas which returns search engine indexing for different keywords. The API is asynchornous, so can take up to a minute to get a response, so a Web App solution needs to be made.
The flow I used is as follows:
function makeApiCall(url, method, site) {
const public_key = ""
const private_key = ""
const salt = ""
let timestamp = Date.now()
const hash = Utilities.computeHmacSha256Signature(timestamp + public_key + salt, private_key)
const headers = {
"Authorization": "KeyAuth publicKey=" + public_key + " hash=" + toHexString(hash) + " ts=" + timestamp,
"Content-Type": "application/json"
}
const requestParameters = {
"search_engine": "google",
"region": "us",
"language": "en",
"max_results": 100,
"phrase": site,
"search_type": "web",
"user_agent": "pc",
"parameters": {
"priority": "standard"
},
"callback_type": "full",
"callback": "script-web-app-exec-url"
}
const options = {
"method": method,
"headers": headers,
"muteHttpExceptions": true,
"payload": JSON.stringify(requestParameters)
}
const response = UrlFetchApp.fetch(url, options)
return response
}
function toHexString(byteArray) {
const hexString = Array.from(byteArray, function(byte) {
return ('0' + (byte & 0xFF).toString(16)).slice(-2)
}).join('')
return hexString
}
And a doPost(e)
function so that when the API returns data it can be processed:
function doPost(e) {
const jsonData = JSON.parse(e.postData.contents)
const pages = jsonData.response.summary.pages
const ss = SpreadsheetApp.openById("1QBzDdGn1yaUxFJciLH_Ru-BbLHuBIZTUk2UnrUShGw0")
if (Object.keys(pages).length == 0) {
ss.getSheetByName("Not Google Index").appendRow([jsonData.request.phrase])
}
else {
ss.getSheetByName("Google Index").appendRow([jsonData.request.phrase])
}
}
From here I then published the Web App with the settings:
Execute as: me
Who has access: Anyone
(not Anyone with a Google account
)Remember to then copy the Web App URL when provided and paste it into the "callback": "script-web-app-exec-url"
part of the payload (normally one could use ScriptApp.getService().getUrl()
but as per this issue, when running code from the Script editor, this method returns the /dev
link instead of the /exec
link which will not work).
This can then be run simply like so:
function run() {
const req = makeApiCall("v3.api.analyticsseo.com/serps/", "POST", "asdhfdhdfgdsfser.com")
console.log(req.getContentText())
}
The request will run, a reponse from the API will be logged containing the request object, and then when the request is ready, the Authoritas API will call the script URL you provided in the callback
parameter which will run the doPost()
method.
It's a complicated workaround, but unfortunately web scraping is becoming harder and harder these days.
Upvotes: 5