Codo
Codo

Reputation: 271

VBA Getting to a URL Location without parsing the IE.Document with IE Automation

In the following code we are using IE Automation to get from here

Location 1

"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & Ticker & "&type=10-Q&dateb=&owner=exclude&count=20"

to a location like this

Location 2

https://www.sec.gov/Archives/edgar/data/10795/000119312514042815/bdx-20131231.xml

Is there a way to go from location 1 to location 2 without using IE Automation and finding something more reliable, secure and faster?

For reasons of completeness here is the full code we have by now; by running you will see heavy use of IE:

Option Explicit

Sub MadMule2()
    Dim IE As InternetExplorer
    Dim el
    Dim els
    Dim colDocLinks As New Collection
    Dim Ticker As String
    Dim lnk
    Dim intCounter as Integer    

    Set IE = New InternetExplorer

    IE.Visible = False

    Ticker = Worksheets("Sheet1").Range("A1").Value

    LoadPage IE, "https://www.sec.gov/cgi-bin/browse-edgar?" & _
                  "action=getcompany&CIK=" & Ticker & "&type=10-Q" & _
                  "&dateb=&owner=exclude&count=20"

    Set els = IE.document.getElementsByTagName("a")
    For Each el In els
        If Trim(el.innerText) = "Documents" Then
            colDocLinks.Add el.href
        End If
    Next el

    intCounter = 1

     For Each lnk In colDocLinks
        LoadPage IE, CStr(lnk)
        For Each el In IE.document.getElementsByTagName("a")
            If el.href Like "*[0-9].xml" Then
                ActiveWorkbook.XmlMaps.Add(el, "xbrl").Name = "xbrl Map"
            End If
        Next el
    Next lnk
End Sub

Sub LoadPage(IE As InternetExplorer, URL As String)
    IE.navigate URL
    Do While IE.Busy Or IE.readyState <> READYSTATE_COMPLETE
        DoEvents
    Loop
End Sub

ADDITIONAL

Q: Is there a way to go from location 1 to location 2 without using IE Automation and finding something more reliable, secure and faster?

Can you expand on this?

By mehow


A: Here is a comment for the code block we have received by user2140261 here:

You should look into MSXML it is much faster, secure, and reliable then IE automation.

Since the code opens Internet Explorer, parses the source page to find the href and get to the Web location needed; we wondered if there is a way to go into location 2 without the use of IE. Can it be done with MSXML as user2140261 states?

Upvotes: 1

Views: 1148

Answers (1)

QHarr
QHarr

Reputation: 84465

Here is an example using XHR

Option Explicit
Public Sub GetLinks()
    Dim ticker As String, html As New HTMLDocument, links As Object, i As Long
    ticker = [A1]                                'example is 81251
    Set html = GetHTMLDocument("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" & ticker)
    If html Is Nothing Then Exit Sub
    Set links = html.querySelectorAll("#documentsbutton")
    If links Is Nothing Then Exit Sub
    For i = 0 To links.Length - 1
        Debug.Print GetAbsoluteURL(links(i).getAttribute("href"))
    Next i
End Sub

Public Function GetAbsoluteURL(ByVal relativeURL As String) As String
    If InStr(relativeURL, "about:/") > 0 Then
        GetAbsoluteURL = Replace$(relativeURL, "about:/", "https://www.sec.gov/")
    Else
        GetAbsoluteURL = relativeURL
    End If
End Function

Public Function GetHTMLDocument(ByVal URL As String) As HTMLDocument
    Dim sResponse As String
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .send
        If .Status <> 200 Then
            Exit Function
        Else
            sResponse = StrConv(.responseBody, vbUnicode)
        End If
    End With
    Set GetHTMLDocument = New HTMLDocument
    GetHTMLDocument.body.innerHTML = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
End Function

Sample ticker is 81251


Sample output:

Sample output


CSS selectors and .querySelectorAll

  You can target the documents buttons with a CSS selector of "#documentsbutton". Now, whilst # means id, and id is usually unique, for this particular website it seems that the same id is used to identify all elements within a table.

The CSS selector query returns the following (sample) from the page:

CSS query


Applying the CSS selector in VBA:

  As more than one element is matched, the .querySelectorAll method of document is used to return a nodeList of items matched by the css selector:

html.querySelectorAll("#documentsbutton")

  We can traverse the nodeList, along its .Length, and access individual document button elements by index (position along nodeList; starting at 0):

For i = 0 To links.Length - 1

  As you want the hyperlinked info you can simply pull the href attribute from each matched element:

links(i).getAttribute("href")

  This returns a relative path so we use a tiny helper function, GetAbsoluteURL, to get the absolute path.


Closing notes:

  Whilst not completely adhering to the Single Responsibility Principle this does show some of the benefits of re-factoring your code. You can then re-use GetHTMLDocument function to handle the new URLs returned by GetAbsoluteURL

Upvotes: 1

Related Questions