Martin Dreher
Martin Dreher

Reputation: 1564

VBA webscraper - Return InnerHTML with regex

Using Excel VBA, i have to scrape some data from this website.

Since the relevant website objects dont contain an id, I cannot use HTML.Document.GetElementById.

However, I noticed that the relevant information is always stored in a <div>-section like the following:

<div style="padding:7px 12px">Basler Versicherung AG &#214;zmen</div>

Question: Is it possible to construct a RegExp that, probably in a Loop, returns the contents inside <div style="padding:7px 12px"> and the next </div>?

What I have so far is the complete InnerHtml of the container, obviously I need to add some code to loop over the yet-to-be-constructed RegExp.

Private Function GetInnerHTML(url As String) As String
    Dim i As Long
    Dim Doc As Object
    Dim objElement As Object
    Dim objCollection As Object

On Error GoTo catch
   'Internet Explorer Object is already assigned
   With ie
        .Navigate url
        While .Busy
            DoEvents
        Wend
        GetInnerHTML = .document.getelementbyId("cphContent_sectionCoreProperties").innerHTML
    End With
    Exit Function
catch:
    GetInnerHTML = Err.Number & " " & Err.Description
End Function

Upvotes: 2

Views: 687

Answers (2)

SIM
SIM

Reputation: 22440

Another way you can achieve the same using XMLHTTP request method. Give it a go:

Sub Fetch_Data()
    Dim S$, I&

    With New XMLHTTP60
        .Open "GET", "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649", False
        .send
        S = .responseText
    End With

    With New HTMLDocument
        .body.innerHTML = S
        With .querySelectorAll("#cphContent_sectionCoreProperties label[id^='cphContent_ct']")
            For I = 0 To .Length - 1
                Cells(I + 1, 1) = .Item(I).innerText
                Cells(I + 1, 2) = .Item(I).NextSibling.FirstChild.innerText
            Next I
        End With
    End With
End Sub

Reference to add to the library before executing the above script:

Microsoft HTML Object Library
Microsoft XML, V6.0

Upvotes: 2

Ryan Wildry
Ryan Wildry

Reputation: 5677

I don't think you need Regular expressions to find the content on the page. You can use the relative positions of the elements to find the content I believe you are after.

Code

Option Explicit

Public Sub GetContent()
    Dim URL     As String: URL = "https://www.uid.admin.ch/Detail.aspx?uid_id=CHE-105.805.649"
    Dim IE      As Object: Set IE = CreateObject("InternetExplorer.Application")
    Dim Labels  As Object
    Dim Label   As Variant
    Dim Values  As Variant: ReDim Values(0 To 1, 0 To 5000)
    Dim i       As Long

    With IE
        .Navigate URL
        .Visible = False

        'Load the page
        Do Until IE.busy = False And IE.readystate = 4
            DoEvents
        Loop

        'Find all labels in the table
        Set Labels = IE.document.getElementByID("cphContent_pnlDetails").getElementsByTagName("label")

        'Iterate the labels, then find the divs relative to these
        For Each Label In Labels
            Values(0, i) = Label.InnerText
            Values(1, i) = Label.NextSibling.Children(0).InnerText
            i = i + 1
        Next

    End With

    'Dump the values to Excel
    ReDim Preserve Values(0 To 1, 0 To i - 1)
    ThisWorkbook.Sheets(1).Range("A1:B" & i) = WorksheetFunction.Transpose(Values)

    'Close IE
    IE.Quit
End Sub

Upvotes: 2

Related Questions