Mislav Sagovac
Mislav Sagovac

Reputation: 195

Web scrape table from site

I want to web scrape one table from following website: https://www.katastar.hr

To follow what I want, please open inspect, than click network. Now, when you open site you can see there is a URL: https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined

The problem is that id and status are different every time you open the site. How can I scrape output of the above request (which is a json, that is a table), when there is different GET queries every time?

I would give reproducible example, but there is nothing special I can try. I should start from home page, but I don't know how to proceed:

headers <- c(
  "Accept" = 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Encoding' = "gzip, deflate, br",
  'Accept-Language' = 'hr-HR,hr;q=0.9,en-US;q=0.8,en;q=0.7',
  "Cache-Control" = "max-age=0",
  "Connection" = "keep-alive",
  "DNT" = "1",
  "Host" = "www.katastar.hr",
  "If-Modified-Since" = "Mon, 22 Mar 2021 13:39:38 GMT",
  "Referer" = "https://www.google.com/",
  "sec-ch-ua" = '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
  "sec-ch-ua-mobile" = "?0",
  "Sec-Fetch-Dest" = "document",
  "Sec-Fetch-Mode" = "navigate",
  "Sec-Fetch-Site" = "same-origin",
  "Sec-Fetch-User" = "?1",
  "Upgrade-Insecure-Requests" = "1",
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
)
p <- httr::GET(
  "https://www.katastar.hr/",
  add_headers(headers))
httr::cookies(p)

The code can be in both R and python.

Upvotes: 2

Views: 231

Answers (1)

Bertrand Martel
Bertrand Martel

Reputation: 45513

You just need the http header Origin to make it work:

  • python
import requests

r = requests.get(
    "https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
    headers={
        "Origin": "https://www.katastar.hr"
    })

print(r.json())

repl.it: https://replit.com/@bertrandmartel/ScrapeKatastar

  • R
library(httr)

data <- content(GET(
  "https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position?id=2432593&status=1332094865186&x=undefined&y=undefined",
  add_headers(origin = "https://www.katastar.hr")
  ), as = "parsed", type = "application/json")

print(data)

To go a little further into how the website generates id and status, there is this following code in JS:

e.prototype.getSurveyors = function(e) {
    var t = this.runbase(),
      n = this.create(t.toString(), null);
    return this.httpClient.get(s + "/position", {
      params: {
        id: t.toString(),
        status: n,
        x: String(e[0]),
        y: String(e[1])
      }
    })
}
e.prototype.runbase = function() {
    return Math.floor(1e7 * Math.random())
}
e.prototype.create = function(e, t) {
    for (var n = 0, i = 0; i < e.length; i++) n = (n << 5) - n + e.charAt(i).charCodeAt(0), n &= n;
    return null == t && (t = e), Math.abs(n).toString().substring(0, 6) + (Number(t) << 1)
}

It takes a random number id and encodes it using a specific algorithm, and puts the result into status field. The server then checks if status encoded value match the id value.

It seems previous id values still work as in the sample above (in case there is no data sent), but you can also reproduce the JS function above like this (example in ):

from random import randint
import ctypes
import requests

number = randint(1000000, 9999999)

def encode(rand, data):
    randStr = str(rand)
    n = 0
    for char in randStr:
        n = ctypes.c_int(n << 5).value - n + ord(char)
    n = ctypes.c_int(n & n).value
    if data is None:
        suffix = ctypes.c_int(rand << 1).value
    else:
        suffix = ctypes.c_int(data << 1).value
    return f"{str(abs(n))[:6]}{suffix}"

r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/lrInstitutions/position",
                 params={
                     "id": number,
                     "status": encode(number, None)
                 },
                 headers={
                     "Origin": "https://www.katastar.hr"
                 })
print(r.json())

# GET parcel Id 13241901
parcelId = 13241901
r = requests.get("https://oss.uredjenazemlja.hr/rest/katHr/parcelInfo",
                 params={
                     "id": number,
                     "status": encode(number, parcelId)
                 },
                 headers={
                     "Origin": "https://www.katastar.hr"
                 })
print(r.json())

repl.it: https://replit.com/@bertrandmartel/ScrapeKatastarDecode

Upvotes: 2

Related Questions