KorwinNaSloniu
KorwinNaSloniu

Reputation: 35

Webscraping prices from willhaben.at using xpath in importxml function in google-sheets

With importxml function in google-sheets i am trying to webscrape car prices from this web site: https://www.willhaben.at/iad/gebrauchtwagen/auto/gebrauchtwagenboerse?CAR_MODEL%2FMODEL=1230&CAR_MODEL%2FMAKE=1018&sort=3 I tried many options for example:

1)

//div[@class='info']/span[@class='pull-right']

But i never can get the prices to my google-sheet. Either i am getting error or empty result. I am starting to suppose that the website has some kind protection against scraping prices. Is that possible?

thanks for help,

Upvotes: 2

Views: 890

Answers (1)

qwermike
qwermike

Reputation: 1486

Don't know whether to call it protection. But the prices you want to extract are the dynamic content. So you can't easily access it with XPath.

For example, the price for the first car "Honda Jazz 1,3i-VTEC Trend" is generated by script:

<div id='e4b79dff-ede1-496a-8460-ca22374cf998'>&lt;Platzhalter&gt;</div>
<script>
    var f053143d96c794669bf0a23f42f5486ac = function(value){return base64UTF8Codec.decode(arguments[0])};
    replaceWith(document.getElementById('e4b79dff-ede1-496a-8460-ca22374cf998'), f053143d96c794669bf0a23f42f5486ac('DQogICAgICAgICAgICAgICAgPHNwYW4gY2xhc3M9InB1bGwtcmlnaHQiPiAxOS45OTAsLSA8L3NwYW4+DQogICAgICAgICAgICA='));
</script>

In the script you can see the value:

 DQogICAgICAgICAgICAgICAgPHNwYW4gY2xhc3M9InB1bGwtcmlnaHQiPiAxOS45OTAsLSA8L3NwYW4+DQogICAgICAgICAgICA=

, which is base64 encoded string. If you decode it, you will get:

<span class="pull-right"> 19.990,- </span>

, which contains the price.

So if you want to extract car prices from this web site, you probably need to write Google Apps Script.


Example of such Apps Script:

function decodeElement(encoded) {
  var bytes = Utilities.base64Decode(encoded);
  var blob = Utilities.newBlob(bytes);
  return blob.getDataAsString();
}

function extractEncodedElement(scriptStr) {
  var rgx = new RegExp(".*'([^']+)'\\)\\);");
  return rgx.exec(scriptStr)[1];
}

function extractPrice(scriptStr) {
  var encoded = extractEncodedElement(scriptStr);
  var decoded = decodeElement(encoded);
  var xml = XmlService.parse(decoded);
  return xml.getRootElement().getText();
}

function extractPrices(strs) {
  var result = [];
  for (i = 0; i < strs.length; i++) {
    result.push(extractPrice(strs[i]));
  }
  return result;
}

So after you save this script, you can use it in spreadsheet in cell:

=extractPrices(IMPORTXML("your_link_here", "//div[@class='info']/script"))

It will return you array of prices. extractPrices iterates over results from IMPORTXML and extracts price for each result.

Upvotes: 1

Related Questions