cget
cget

Reputation: 410

Getting an error trying to pull out text using Google Sheets and importxml()

I have a column of links in Google Sheets. I want to tell if a page is producing an error message using importxml

As an example, this works fine

=importxml("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_T", "//td/b")

i.e. it looks for td, and pulls out b (which are postcodes in Canada)

But this code that looks for the error message does not work:

=importxml("https://www.awwwards.com/error1/", "//div/h1" )

I want it to pull out the "THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST."

...on this page https://www.awwwards.com/error1/

I'm getting a Resource at URL not found error. What could I be doing wrong? Thanks

enter image description here

Upvotes: 0

Views: 1415

Answers (2)

player0
player0

Reputation: 1

after quick trial and error with default formulae:

=IMPORTXML("https://www.awwwards.com/error1/", "//*")

=IMPORTHTML("https://www.awwwards.com/error1/", "table", 1)

=IMPORTHTML("https://www.awwwards.com/error1/", "list", 1)

=IMPORTDATA("https://www.awwwards.com/error1/")

it seems that the website is not possible to be scraped in Google Sheets by any means (regular formulae)

Upvotes: 1

Tanaike
Tanaike

Reputation: 201388

  • You want to retrieve the value of THE PAGE YOU WERE LOOKING FOR DOESN'T EXIST. from the URL of https://www.awwwards.com/error1/.

If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.

Issue and workaround:

I think that the page of your URL is Error 404 (Not Found). So in this case, the status code of 404 is returned. I thought that by this, the built-in functions like IMPORTXML might not be able to retrieve the HTML data.

So as one workaround, how about using a custom function with UrlFetchApp? When UrlFetchApp is used, the HTML data can be retrieved even when the status code is 404.

Sample script for custom function:

Please copy and paste the following script to the script editor of the Spreadsheet. And please put =SAMPLE("https://www.awwwards.com/error1") to a cell on the Spreadsheet. By this, the script is run.

function SAMPLE(url) {
  return UrlFetchApp
    .fetch(url, {muteHttpExceptions: true})
    .getContentText()
    .match(/<h1>([\w\s\S]+)<\/h1>/)[1]
    .toUpperCase();
}

Result:

enter image description here

Note:

  • This custom function is for the URL of https://www.awwwards.com/error1. When you use this for other URL, the expected results might not be able to be retrieved. Please be careful this.

References:

If this was not the direction you want, I apologize.

Upvotes: 0

Related Questions