How to solve error while parsing HTML

I´m trying to get the elements from a web page in Google spreadsheet using:

function pegarAsCoisas() {
  var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
  var elements = XmlService.parse(html);                 
}

However I keep geting the error:

Error on line 2: Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character. (line 4, file "")

How do I solve this? I want to get the H1 text from this site, but for other sites I´ll have to select other elements.

I know the method XmlService.parse(html) works for other sites, like Wikipedia. As you can see here.

Upvotes: 2

Answers (4)

Wicket

Reputation: 38210

Technically HTML and XHTML are not the same. See What are the main differences between XHTML and HTML?

Regarding the OP code, the following works just fine

function pegarAsCoisas() {
  var html =  UrlFetchApp
    .fetch('http://www.saosilvestre.com.br')
    .getContentText();
  Logger.log(html);
}

As was said on previous answers, other methods should be used instead of using the XmlService directly on the object returned by UrlFetchApp. You could try first to convert the web page source code from HTML to XHTML in order to be able to use the Xml Service Service (XmlService), use the Xml Service as it could work directly with HTML pages, or to handle the web page source code directly as a text file.

Related questions:

Upvotes: 1

Sujay Phadke

Reputation: 2196

The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML

Upvotes: 1

Diego Gomes

Reputation: 1

Try replace itemscope by itemscope = '':

function pegarAsCoisas() {
  var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
  html = replace("itemscope", "itemscope = ''");
  var elements = XmlService.parse(html);                 
}

For more information, look here.

Upvotes: -1

Alan Wells

Reputation: 31300

The html isn't xml. And you don't need to try to parse it. You need to use string methods:

function pegarAsCoisas() {

  var urlFetchReturn = UrlFetchApp.fetch("http://www.saosilvestre.com.br");
  var html = urlFetchReturn.getContentText();

  Logger.log('html.length: ' + html.length);

  var index_OfH1 = html.indexOf('<h1');
  var endingH1 = html.indexOf('</h1>');

  Logger.log('index_OfH1: ' + index_OfH1);
  Logger.log('endingH1: ' + endingH1);

  var h1Content = html.slice(index_OfH1, endingH1);
  var h1Content = h1Content.slice(h1Content.indexOf(">")+1);

  Logger.log('h1Content: ' + h1Content);

};

Upvotes: 2

How to solve error while parsing HTML

Answers (4)

Related Questions