Getting an error trying to pull out text using Google Sheets and importxml()

I have a column of links in Google Sheets. I want to tell if a page is producing an error message using importxml

As an example, this works fine

=importxml("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T", "//td/b")

i.e. it looks for td, and pulls out b (which are postcodes in Canada)

But this code that looks for the error message does not work:

=importxml("https://www.awwwards.com/error1/", "//div/h1" )

I want it to pull out the "THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST."

...on this page https://www.awwwards.com/error1/

I'm getting a Resource at URL not found error. What could I be doing wrong? Thanks

Upvotes: 0

Answers (2)

player0

Reputation: 1

after quick trial and error with default formulae:

=IMPORTXML("https://www.awwwards.com/error1/", "//*")

=IMPORTHTML("https://www.awwwards.com/error1/", "table", 1)

=IMPORTHTML("https://www.awwwards.com/error1/", "list", 1)

=IMPORTDATA("https://www.awwwards.com/error1/")

it seems that the website is not possible to be scraped in Google Sheets by any means (regular formulae)

Upvotes: 1

Tanaike

Reputation: 201388

You want to retrieve the value of THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST. from the URL of https://www.awwwards.com/error1/.

If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.

Issue and workaround:

I think that the page of your URL is Error 404 (Not Found). So in this case, the status code of 404 is returned. I thought that by this, the built-in functions like IMPORTXML might not be able to retrieve the HTML data.

So as one workaround, how about using a custom function with UrlFetchApp? When UrlFetchApp is used, the HTML data can be retrieved even when the status code is 404.

Sample script for custom function:

Please copy and paste the following script to the script editor of the Spreadsheet. And please put =SAMPLE("https://www.awwwards.com/error1") to a cell on the Spreadsheet. By this, the script is run.

function SAMPLE(url) {
  return UrlFetchApp
    .fetch(url, {muteHttpExceptions: true})
    .getContentText()
    .match(/<h1>([\w\s\S]+)<\/h1>/)[1]
    .toUpperCase();
}

Result:

Note:

This custom function is for the URL of https://www.awwwards.com/error1. When you use this for other URL, the expected results might not be able to be retrieved. Please be careful this.

References:

Custom Functions in Google Sheets
fetch(url, params)

muteHttpExceptions: If true the fetch doesn't throw an exception if the response code indicates failure, and instead returns the HTTPResponse. The default is false.
match()
toUpperCase()

If this was not the direction you want, I apologize.