Reputation: 1133
I´m trying to get the elements from a web page in Google spreadsheet using:
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
var elements = XmlService.parse(html);
}
However I keep geting the error:
Error on line 2: Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character. (line 4, file "")
How do I solve this? I want to get the H1 text from this site, but for other sites I´ll have to select other elements.
I know the method XmlService.parse(html)
works for other sites, like Wikipedia. As you can see here.
Upvotes: 2
Views: 2438
Reputation: 38210
Technically HTML and XHTML are not the same. See What are the main differences between XHTML and HTML?
Regarding the OP code, the following works just fine
function pegarAsCoisas() {
var html = UrlFetchApp
.fetch('http://www.saosilvestre.com.br')
.getContentText();
Logger.log(html);
}
As was said on previous answers, other methods should be used instead of using the XmlService directly on the object returned by UrlFetchApp. You could try first to convert the web page source code from HTML to XHTML in order to be able to use the Xml Service Service (XmlService), use the Xml Service as it could work directly with HTML pages, or to handle the web page source code directly as a text file.
Related questions:
Upvotes: 1
Reputation: 2196
The XMLService service works only with 100% correct XML content. It's not error tolerant. Google apps script used to have a tolerant service called XML service but it was deprecated. However, it still works and you can use that instead as explained here: GAS-XML
Upvotes: 1
Reputation: 1
Try replace itemscope
by itemscope = ''
:
function pegarAsCoisas() {
var html = UrlFetchApp.fetch("http://www.saosilvestre.com.br").getContentText();
html = replace("itemscope", "itemscope = ''");
var elements = XmlService.parse(html);
}
For more information, look here.
Upvotes: -1
Reputation: 31300
The html
isn't xml. And you don't need to try to parse it. You need to use string methods:
function pegarAsCoisas() {
var urlFetchReturn = UrlFetchApp.fetch("http://www.saosilvestre.com.br");
var html = urlFetchReturn.getContentText();
Logger.log('html.length: ' + html.length);
var index_OfH1 = html.indexOf('<h1');
var endingH1 = html.indexOf('</h1>');
Logger.log('index_OfH1: ' + index_OfH1);
Logger.log('endingH1: ' + endingH1);
var h1Content = html.slice(index_OfH1, endingH1);
var h1Content = h1Content.slice(h1Content.indexOf(">")+1);
Logger.log('h1Content: ' + h1Content);
};
Upvotes: 2