Lalit Patel
Lalit Patel

Reputation: 115

Web scraping using VBA

I would like to extract data from this URL.

I want to extract Title, mobile contact number and address from each of 10 business cards.

enter image description here

Here is some code I tried but did not get success.

Public Sub GetValueFromBrowser()
    On Error Resume Next
    Dim Sn As Integer
    Dim ie As Object
    Dim url As String
    Dim Doc As HTMLDocument
    Dim element As IHTMLElement
    Dim elements As IHTMLElementCollection

    For Sn = 1 To 1

        url = Sheets("Infos").Range("C" & Sn).Value

        Set ie = CreateObject("InternetExplorer.Application")

        With ie
            .Visible = 0
            .navigate url
            While .Busy Or .readyState <> 4
                DoEvents
            Wend
        End With    

        Set Doc = ie.document
        Set elements = Doc.getElementsByClassName(" col-sm-5 col-xs-8 store-details sp-detail paddingR0")

        Dim count As Long
        Dim erow As Long
        count = 0
        For Each element In elements
            If element.className = "lng_cont_name" Then
                erow = Sheet1.Cells(Rows.count, 1).End(xlUp).Offset(1, 0).Row
                Cells(erow, 1) = Doc.getElementsByClassName("Store-Name")(count).innerText
                Cells(erow, 2) = Doc.getElementsByClassName("cont_fl_addr")(count).innerText
                count = count + 1
            End If
        Next element

        If Val(Left(Sn, 2)) = 99 Then
            ActiveWorkbook.Save
        End If

    Next Sn
End Sub

Upvotes: 1

Views: 547

Answers (1)

QHarr
QHarr

Reputation: 84465

The tel numbers were not easy as I think they have purposefully been made difficult to scrape. I have found a way to decipher the values from the CSS pseudo ::before element content. The addresses and titles were straightforward CSS selections.


I have since written a cleaner script in python here.


So, how do the various parts of the code work?

titles:

Set titles = .querySelectorAll(".jcn [title]")

I target the titles as elements that have a title attribute with a parent jcn class attribute. The "." indicates a class selector, the "[]" an attribute selector, and the " " in between is a descendant combinator.

enter image description here

querySelectorAll method of document returns a nodeList of all matching elements on the page i.e. the 10 titles.


addresses:

Set addresses = .querySelectorAll(".desk-add.jaddt")

Addresses are targeted by their class attribute desk-add jaddt. As compound class names are not allowed, an additional "." has to be replace the white space in the name.

enter image description here


Telephone numbers (via deciphering contents within storesTextToDecipher) :

Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")

This is where the magic happens. The numbers are not available via the DOM directly as they are pseudo element content.

If you inspect the relevant HTML you will find a series of pseudo ::before elements. VBA exposes no mechanism for applying pseudo selectors to try and get at this info in the CSS for the page.

What you see is in fact a series of span elements that each have a class attribute beginning with mobilesv. These elements sit within a single parent element of class col-sm-5 col-xs-8 store-details sp-detail paddingR0 (note again the compound class name).

I initially gather a nodeList of all the parent elements.

Sample of returned elements:

Each of these parent elements houses the class name (beginning with mobilesv) elements that constitute the characters of the telephone number string. Some characters are numbers in the string, others represent the +()-. Note: The 2|3 letter strings, in the class names, after icon- e.g. dc, fe.

For example, the first search result on the page, for the initial number 9 in the telephone number:

enter image description here

The actual CSS content for this pseudo element /telephone character can be observed in the CSS styling:

enter image description here

Note the class name and before pseudo element selector: .icon-ji:before And the content is \9d010.

Long story short.... you can extract the 2 or 3 letters after icon- i.e. ji in this case, and the number string after \9d0, i.e. 10 in this case, and use these two bits of info to decipher the telephone number. This info is available in the response:

See the same 2/3 letter strings that are associated with the class names of the telephone string on the left, and the content instructions on the right. A little maths deduces that the number on the right is 1 greater than the telephone number, for that class, shown in the image of the webpage. I simply create a dictionary that then maps the 2/3 letter abbreviation to the telephone number by parsing this section of the html.

When looping over storesTextToDecipher, I use this dictionary to decipher the actual telephone number from the matching 2/3 letter abbreviation in the class name.


VBA:

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As HTMLDocument
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
  
    Dim cipherKey As String, cipherDict As Object
    Set cipherDict = CreateObject("Scripting.Dictionary")
    cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
    cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
    
    Dim arr() As String, tempArr() As String, i As Long, j As Long
    arr = Split(cipherKey, """}.icon-")
    For i = LBound(arr) To UBound(arr)
        tempArr = Split(arr(i), Chr$(32))
        cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
    Next

    html.body.innerHTML = sResponse
    Dim titles As Object, addresses As Object, storesTextToDecipher As Object
    With html
        Set titles = .querySelectorAll(".jcn [title]")
        Set addresses = .querySelectorAll(".desk-add.jaddt")
        Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
    End With
    
    For i = 0 To titles.Length - 1
        Debug.Print "title: " & titles.item(i).innerText
        Debug.Print "address: " & addresses.item(i).innerText
        Debug.Print GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
    Next
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
    Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
    Set html2 = New HTMLDocument
    html2.body.innerHTML = storeInfo.innerHTML
    Set elems = html2.querySelectorAll("b span")
    For j = 0 To elems.Length - 1
        On Error Resume Next
        If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
            telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
        End If
        On Error GoTo 0
    Next
   GetStoreNumber = telNumber
End Function

Sample output:

enter image description here


Edit: All page results

As you now want more than 10 the following uses the expected page result count (NUMBER_RESULTS_ON_PAGE) to gather the information from the page. It scrolls the page until the expected number of telephone numbers (which should be unique) are found, or the MAX_WAIT_SEC is hit. This means you avoid an infinite loop and can set your expected result count if you expect a different number. It does rely on each store having a telephone number listed - this seems a fairly reasonable assumption.

Option Explicit
Public Sub GetInfo()
    Dim IE As New InternetExplorer, resultCountDict As Object, cipherDict As Object, t As Date
    Const MAX_WAIT_SEC As Long = 300 'wait 5 minutes max before exiting loop to get all results
    Const NUMBER_RESULTS_ON_PAGE As Long = 80
    Const URL = "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3"
    
    Dim titles As Object, addresses As Object, storesTextToDecipher As Object
    
    Application.ScreenUpdating = True
    
    Set resultCountDict = CreateObject("Scripting.Dictionary")
    Set cipherDict = GetCipherDict(URL)
    
    With IE
        .Visible = True
        .Navigate2 URL
       
        While .Busy Or .readyState < 4: DoEvents: Wend
        
        With .document
            t = Timer
            Do
                DoEvents
                Set titles = .querySelectorAll(".jcn [title]")
                Set addresses = .querySelectorAll(".desk-add.jaddt")
                Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
                Dim telNumber As String, i As Long
                       
                For i = 0 To titles.Length - 1
                    telNumber = GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
                    If Not resultCountDict.Exists(telNumber) Then
                        resultCountDict.Add telNumber, Array(titles.item(i).innerText, addresses.item(i).innerText, telNumber)
                    End If
                Next
            
                .parentWindow.execScript "window.scrollBy(0, window.innerHeight);", "javascript"
                
                While IE.Busy Or IE.readyState < 4: DoEvents: Wend
                If Timer - t > MAX_WAIT_SEC Then Exit Do
            Loop Until resultCountDict.Count = NUMBER_RESULTS_ON_PAGE

        End With
        .Quit
    End With
    
    Dim key As Variant, rowCounter As Long
    rowCounter = 1
    With ThisWorkbook.Worksheets("Sheet1")
        For Each key In resultCountDict.keys
            .Cells(rowCounter, 1).Resize(1, 3) = resultCountDict(key)
            rowCounter = rowCounter + 1
        Next
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
    Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
    Set html2 = New HTMLDocument
    html2.body.innerHTML = storeInfo.innerHTML
    Set elems = html2.querySelectorAll("b span")
    For j = 0 To elems.Length - 1
        On Error Resume Next
        If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
            telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
        End If
        On Error GoTo 0
    Next
   GetStoreNumber = telNumber
End Function

Public Function GetCipherDict(ByVal URL As String) As Object
    Dim sResponse As String, html As HTMLDocument
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    Dim cipherKey As String, cipherDict As Object
    Set cipherDict = CreateObject("Scripting.Dictionary")
    cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
    cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))

    Dim arr() As String, tempArr() As String, i As Long, j As Long
    arr = Split(cipherKey, """}.icon-")
    For i = LBound(arr) To UBound(arr)
        tempArr = Split(arr(i), Chr$(32))
        cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
    Next
    Set GetCipherDict = cipherDict
End Function

EDIT:

Version for where more than one number is present at top (Please note that if you make too many requests or too quickly server will serve you random pages):

Option Explicit

Public Sub GetDetails()
    Dim re As Object, decodeDict As Object, i As Long
    Dim html As MSHTML.htmlDocument, responseText As String, keys(), values()
    
    Set decodeDict = CreateObject("Scripting.Dictionary")
    Set re = CreateObject("vbscript.regexp")
    Set html = New MSHTML.htmlDocument
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.justdial.com/chengalpattu/Oasis-Pharma-Near-Saraswathi-Children-School-Revathypuram-Urapakkam/9999PXX44-XX44-181123145524-X8G7_BZDET", False
        .setRequestHeader "User-Agent", "Mozilla/4.0"
        .send
        responseText = .responseText
        html.body.innerHTML = responseText
    End With
    
    keys = GetMatches(re, responseText, "-(\w+):before")

    If UBound(keys) = 0 Then Exit Sub
    
    values = GetMatches(re, responseText, "9d0(\d+)", True)
   
    For i = LBound(values) To UBound(values)
        decodeDict(keys(i)) = values(i)
    Next
    
    Dim itemsToDecode()
    
    decodeDict(keys(UBound(keys))) = "+"

    itemsToDecode = GetValuesToDecode(html)
    
    PrintNumbers re, html, itemsToDecode, decodeDict
End Sub

Public Function GetMatches(ByVal re As Object, ByVal inputString As String, ByVal sPattern As String, Optional ByVal numeric = False, Optional ByVal spanSearch = False) As Variant
    Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long

    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern

        If .Test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                If numeric Then
                    arrMatches(i) = iMatch.SubMatches.item(0) - 1
                Else
                    If spanSearch Then
                        arrMatches(i) = iMatch
                    Else
                        arrMatches(i) = iMatch.SubMatches.item(0)
                    End If
                End If
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

Public Function GetValuesToDecode(ByVal html As MSHTML.htmlDocument) As Variant
    Dim i As Long, elements As Object, results(), Class As String

    Set elements = html.querySelectorAll(".telCntct span[class*='icon']")
    
    ReDim results(elements.Length - 1)
    For i = 0 To elements.Length - 1
        Class = elements.item(i).className
        results(i) = Right$(Class, Len(Class) - InStrRev(Class, "-"))
    Next
    GetValuesToDecode = results
End Function

Public Sub PrintNumbers(ByVal re As Object, ByVal html As htmlDocument, ByVal itemsToDecode As Variant, ByVal decodeDict As Object)
    Dim output As String, i As Long

    For i = LBound(itemsToDecode) To UBound(itemsToDecode)
        output = output & decodeDict(itemsToDecode(i))
    Next
    
    Dim htmlToSearch As String, groups As Variant, startPos As Long, oldStartPos As Long
    
    htmlToSearch = html.querySelector(".telCntct").outerHTML

    groups = GetMatches(re, htmlToSearch, "mobilesv|,", False, True)
 
    startPos = 1
    
    Dim totalNumbers As Long
    
    For i = LBound(groups) To UBound(groups)
        If InStr(groups(i), ",") > 0 Then
            totalNumbers = totalNumbers + 1
            Debug.Print Mid$(output, startPos, IIf(startPos = 1, i, i - startPos))
            startPos = i + 1
        End If
    Next
    If totalNumbers = 1 Then Debug.Print Right$(output, Len(output) - startPos - 1)
End Sub

Upvotes: 3

Related Questions