Delecron
Delecron

Reputation: 119

Web Scraping that Requires User Interaction

I'm trying to scrape a site https://ibotta.com/rebates that requires you to scroll down and when it hits the bottom, loads more items. Its a finite amount of items so I know it won't be scrolling forever but is there any method of doing this without having to interact with a browser object.

I'm trying to accomplish this in VB / VBA but any language would do. Right now I templated it in MS Access just to get a feeling for how the site reacts, I can do it with the browser control loaded but its clunky. Preferably something I can just make an HTTP call to.

On a side note, are they any good web scraping tutorials out their I should be looking at?

Upvotes: 1

Views: 459

Answers (1)

omegastripes
omegastripes

Reputation: 12612

At the first sight XHRs I examined in Chrome - Developer tools - Network tab show that all the necessary data located in 2 files: retailers.json (15.7 kB) and offers.json (299 kB). While you are scrolling down the page actually no additional data is dowloaded, so I made a conclusion that scripts on the page just fetch data from that already downloaded files and put items to the page. I checked parameters and headers of the XHRs, and created the below simple VBS, which downloads the files:

strZipCode = "11590" ' your zip code here
strPathRetailers = "C:\retailers.json" ' retailers output file path
strPathOffers = "C:\offers.json" ' offers output file path

' make XHR to retrieve initial page with X-App-Token and X-NewRelic-ID
strURL = "https://ibotta.com/rebates"
XmlHttpRequest "GET", strURL, "", "", "", strResp

' extract X-App-Token eg 'loader_config={xpid:"VQAHUlVUGwcJUlBWBQg="}'
arrTmp = Split(strResp, "loader_config={xpid:""", 2)
strTmp = arrTmp(1)
arrTmp = Split(strTmp, """}", 2)
strNewRelicID = arrTmp(0)

' extract X-NewRelic-ID eg '<meta name="ibotta-t" content="nce0dc967myuho7wco:1458857196:91bf12dcd5442cf6b2100c962c656a510738150a">'
arrTmp = Split(strResp, "<meta name=""ibotta-t"" content=""", 2)
strTmp = arrTmp(1)
arrTmp = Split(strTmp, """>", 2)
strAppToken = arrTmp(0)

' put headers to array
arrHeaders = Array( _
    Array("Accept", "application/json, text/javascript"), _
    Array("Accept-Encoding", "deflate"), _
    Array("Accept-Language", "en-US,en;q=0.5"), _
    Array("Connection", "keep-alive"), _
    Array("Host", "ibotta.com"), _
    Array("If-Modified-Since", "Thu, 1 Jan 1970 10:00:00 GMT"), _
    Array("Referer", "https", "//ibotta.com/rebates"), _
    Array("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:38.0) Gecko/20100101 Firefox/38.0"), _
    Array("X-App-Token", strAppToken), _
    Array("X-App-Version", "3.6:webapp"), _
    Array("X-NewRelic-ID", strNewRelicID), _
    Array("X-Requested-With", "XMLHttpRequest") _
)

' make XHR to retrieve retailers
strURL = "https://ibotta.com/web_v1/retailers.json?zip=" & strZipCode
XmlHttpRequest "GET", strURL, arrHeaders, "", "", strResp
' save retailers to file
WriteTextFile strResp, strPathRetailers, -1

' make XHR to retrieve offers
strURL = "https://ibotta.com/web_v1/offers.json"
XmlHttpRequest "GET", strURL, arrHeaders, "", "", strResp
' save offers to file
WriteTextFile strResp, strPathOffers, -1

Sub XmlHttpRequest(strMethod, strURL, arrSetHeaders, strFormData, strRespHeaders, strRespText)
    Dim arrHeader
    With CreateObject("Msxml2.ServerXMLHTTP")
        .SetOption 2, 13056 ' SXH_SERVER_CERT_IGNORE_ALL_SERVER_ERRORS
        .Open strMethod, strURL, False
        If IsArray(arrSetHeaders) Then
            For Each arrHeader In arrSetHeaders
                .SetRequestHeader arrHeader(0), arrHeader(1)
            Next
        End If
        .Send strFormData
        strRespHeaders = .GetAllResponseHeaders
        strRespText = .ResponseText
    End With
End Sub

Sub WriteTextFile(strContent, strPath, lngFormat)
    ' lngFormat -2 - System default, -1 - Unicode, 0 - ASCII
    With CreateObject("Scripting.FileSystemObject").OpenTextFile(strPath, 2, True, lngFormat)
        .Write (strContent)
        .Close
    End With
End Sub

You can save this code to the text file vith .vbs extension and run.

At the moment I can see there 857 offers totally, and 220 retailers for zip code 11590 (used JSON viewers, like built in Chrome, or via web service). If you want to process only the offers for zip code 11590, then you have to get the list of retailers' id, and filter out only the offers, that belong to the retailers from the list.

There is retailers screenshot, each of them has id (outlined with red):

retailers

And there is offers screenshot, each of them belongs to several retailers in retailer_ids (outlined with red also):

offers

Further processing depends on what you need. You can parse JSON string to object and interact it, or convert JSON string to Recordset to filter it.

Upvotes: 1

Related Questions