pomme
pomme

Reputation: 97

Parse HTML to retrieve specific tags value with Google Apps Script

I'm trying to parse a HTML to retrieve the value of tag, on my Google Apps Script code. contains line breaks in attributes, and appears more than once but I only want the first value. (In this case, only 'foo' is required.)

<b class="
"
>
foo
</b><b class="
"
>
var
</b>

On Google Apps Script, functions such as 'getElementByTagName' is not available. So I first though of using regexp but it's not the wise option here. Does anyone have an idea on how I can move forward? Any comment/guess would be highly appreciated!

Upvotes: 1

Views: 5879

Answers (2)

Richard Burden
Richard Burden

Reputation: 1

Don't use the Parser library (https://www.kutil.org/2016/01/easy-data-scrapping-with-google-apps.html) This is NOT an HTML parser at all; it just looks for text between two regular expressions. If you insist on trying it anyway, you will need the new Script ID: 1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw

obtained from the link "Completed code of Parser library" https://script.google.com/d/1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw/edit?usp=drive_web

on the webpage https://www.kutil.org/2016/01/easy-data-scrapping-with-google-apps.html

Upvotes: -1

Tanaike
Tanaike

Reputation: 201613

How about using XmlService for your situation as a workaround? At XmlService, even if there are several line breaks in the tags, the value can be retrieved. I think that there are several workarounds for your situation. So please think of this as one of them.

The flow of sample script is as follows.

Flow :

  1. Add the header of xml and a root element tag to the html.
  2. Parse the creates xml value using XmlService.
  3. Retrieve the first value of tags using XmlService.

Sample script :

var html = '<b class="\n"\n>\nfoo\n</b><b class="\n"\n>\nvar\n</b>\n'; // Your sample value

var xml = '<?xml version="1.0"?><sampleContents>' + html + '</sampleContents>';
var res = XmlService.parse(xml).getRootElement().getChildren()[0].getText().trim();
Logger.log(res) // foo

Note :

  • In this sample script, your sample html was used. So if you use more complicated one, can you provide it? I would like to modify the script.

Reference :

If this was not what you want, please tell me. I would like to modify it.

Edit 1 :

Unfortunately, for the value retrieved from the URL, above script cannot be used. So I used "Parser" which is a GAS library for your situation. The sample script is as follows.

Sample script :

var url = "https://www.booking.com/searchresults.ja.html?ss=kyoto&checkin_year=2018&checkin_month=10&checkin_monthday=1&checkout_year=2018&checkout_month=10&checkout_monthday=2&no_rooms=1&group_adults=1&group_children=0";
var html = UrlFetchApp.fetch(url).getContentText();
var res = Parser.data(html).from("<b class=\"\n\"\n>").to("</b>").build().trim();
Logger.log(res) // US$11

Note :

  • Before you run this script, please install "Parser". About the install of library, you can see it at here.
    • The project key of the library is M1lugvAXKKtUxn_vdAG9JZleS6DrsjUUV

References :

Edit 2 :

For your 2nd URL in your comment, it seems that the URL is different from your 1st one. And also your new URL has no tag of <b class=\"\n\"\n>. By this, the value you want cannot be retrieved. But from the 1st URL in your comment, I presumed about the value what you want. Please confirm the following script?

var url = "https://www.booking.com/searchresults.ja.html?ss=kyotogranvia&checkin_year=2018&checkin_month=10&checkin_monthday=1&checkout_year=2018&checkout_month=10&checkout_monthday=2&no_rooms=1&group_adults=1&group_children=0";
var html = UrlFetchApp.fetch(url).getContentText();
var res = Parser.data(html).from("<span class=\"lp-postcard-avg-price-value\">").to("</span>").build().trim();
Logger.log(res) // US$289

Upvotes: 4

Related Questions