Hard Scraping API

If you navigate to the following url and select Search By Country.

Then insert AE for Holder Country.

As the following:

enter image description here

After you press search. then you will notice an XHR call to the following API which is a POST request.

Here's it:

enter image description here

as you can see there's value for qz which i can't get how it's implemented in order to call the API and do pagination too.

May someone has a clue on how to call that API and do the pagination ?

The best which i reached is the JS functions location which handle the encoding of parameters here

I've already tried selenium with proxy rotation service but i got detected after retrieving some pages.

Upvotes: 10

Views: 886

Answers (2)

baduker
baduker

Reputation: 20052

You need to generate a wipo-visitor-uunid as pass it to the POST request as a cookie along with a bunch of other stuff.

The code that generates the wipo-visitor-uunid is this:

(function (){
    //generate unique visitor id cookie
    if (!Math.imul) Math.imul = function(opA, opB) {
        opB |= 0; 
        var result = (opA & 0x003fffff) * opB;
        if (opA & 0xffc00000) result += (opA & 0xffc00000) * opB |0;
        return result |0;
      };
    
    var _cuunid = 'wipo-visitor-uunid=';
    uunid(0);

    function uunid(force){
        if (force || document.cookie.indexOf(_cuunid)===-1){
            var value = navigator.userAgent + Date.now() + Math.random().toString().substring(2,11);
            var cookie = _cuunid + cyrb53(value) + ';expires=Jan 2 2034 00:00:00; path=/; SameSite=Lax; domain=.wipo.int';
            document.cookie = cookie;    
        } 
    }
    function cyrb53(str, seed) {
        seed = seed || 0;
        let h1 = 0xdeadbeef ^ seed, h2 = 0x8badf00d ^ seed;
        for (let i = 0, ch; i < str.length; i++) {
            ch = str.charCodeAt(i);
            h1 = Math.imul(h1 ^ ch, 2654435761);
            h2 = Math.imul(h2 ^ ch, 1597334677);
        }
        h1 = Math.imul(h1 ^ h1>>>16, 2246822507) ^ Math.imul(h2 ^ h2>>>13, 3266489909);
        h2 = Math.imul(h2 ^ h2>>>16, 2246822507) ^ Math.imul(h1 ^ h1>>>13, 3266489909);
        // return 4294967296 * (2097151 & h2) + (h1>>>0);
        return (h2>>>0).toString(16)+(h1>>>0).toString(16);
    }
}());

The wipo-visitor-uunid is valid till Jan 2 2034, so once you have it, you should be fine.

Oh, and that string that you add to the POST seems to be query region result, but I'm not sure how it's generated. More on that in the other answers to this quesiton.

Here's the code, test it out on your end:

import json

import requests

query_string = "qz=N4IgLgngDgpiBcIBGAnAhgOwCYgDQgBs0EQYM8QBHASxIAYBaGAOSwAUAO" \               "AMzAHY0AYgHcAWtQCuADQD2WNAQBeADyRIhCgIIBFLABlpANQIARAEIBNABIQ" \
               "AVlwCi0gKoBZALwVK4mN4QBGfAB9Ej8/Og46EABfIAAA="

with requests.Session() as s:
    the_cookies = s.get("https://www3.wipo.int/branddb/en/").cookies.get_dict()
    the_cookies["wipo-visitor-uunid"] = "994c22024f522fd"

    s.headers["user-agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
    s.headers["X-Requested-With"] = "XMLHttpRequest"
    s.headers["Referer"] = "https://www3.wipo.int/branddb/en/"

    end_point = f"https://www3.wipo.int/branddb/jsp/select.jsp?{query_string}"
    your_precious_data = s.post(end_point, cookies=the_cookies).json()
    print(json.dumps(your_precious_data, indent=2))

This should return an output that looks like this:

{
  "lastUpdated": 1616081900884,
  "sv": "www3.wipo.int",
  "response": {
    "docs": [
      {
        "OO": "NZ",
        "score": 1,
        "STATUS": "PEND",
        "MTY": [
          "Word"
        ],
        "AD": "2021-03-17T23:59:59Z",
        "HOL": [
          "PONSONBY DOGS LIMITED"
        ],
        "NC": [
          43
        ],
        "SOURCE": "NZTM",
        "DOC": "36/03/1173603_20210317.1919.xml.gz",
        "ID": "NZTM.1173603",
        "BRAND": [
          "Good Dog"
        ],
        "HOLC": [
          "NZ"
        ]
      },
and much, much more data ...

Upvotes: 7

user15398259
user15398259

Reputation:

The qz value is "encoded" JSON using LZString.compressToBase64

dev tools screenshot

The qi value seems to be intially taken from qk in the source HTML with 0- prepended to it.

var qk = "ooooooooooooooooooo";

// if(!(w == 790 && (h == 600 || h == 590))) 

qk = "yj0IAlhpQGl9BLWmmmJ2WMuzofkYFis64bmU5/6mE8w=";

Certain requests require the number to be incremented after you make them.

You also need the cookie given in the other answer.

Upvotes: 16

Related Questions