mchac
mchac

Reputation: 347

Extracting XML attribute using VBA

I'm not a developer and have very limited XML knowledge but for what I've learned the past 3-4 days researching on the web. So apologies in advance for the basic level of this question. I'm trying to wrap up this one time task.

I have some VBA Excel knowledge and currently I'm trying to use VBA to extract the SIC code attribute from a given company's page on the SEC filing website. As an example, this is the site for Walmart

http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000104169&owner=exclude&count=40&hidefilings=0

In the blue bar at the top you can see 'SIC: 5331' it's the 5331 I'm trying to return to a VBA variable so I can populate a spreadsheet. When I right click in IE and clich View Source the part of the page that is relevant reads in XML as:

<div id="contentDiv">
  <!-- START FILER DIV -->
  <div style="margin: 15px 0 10px 0; padding: 3px; overflow: hidden; background-color: #BCD6F8;">
    <div class="mailer">Mailing Address
      <span class="mailerAddress">702 SOUTHWEST 8TH STREET</span>
      <span class="mailerAddress"> BENTONVILLE AR 72716         </span>
    </div>
    <div class="mailer">Business Address
      <span class="mailerAddress">702 SOUTHWEST 8TH ST</span>
      <span class="mailerAddress">BENTONVILLE AR 72716         </span>
      <span class="mailerAddress">5012734000</span>
    </div>
    <div class="companyInfo">
      <span class="companyName">WAL MART STORES INC <acronym title="Central Index Key">CIK</acronym>#: <a href="/cgi-bin/browse-edgar?action=getcompany&amp;CIK=0000104169&amp;owner=exclude&amp;count=40">0000104169 (see all company filings)</a></span>
      <p class="identInfo"><acronym title="Standard Industrial Code">SIC</acronym>: <a href="/cgi-bin/browse-edgar?action=getcompany&amp;SIC=5331&amp;owner=exclude&amp;count=40">5331</a> - RETAIL-VARIETY STORES<br />State location: <a href="/cgi-bin/browse-edgar?action=getcompany&amp;State=AR&amp;owner=exclude&amp;count=40">AR</a> | State of Inc.: <strong>DE</strong> | Fiscal Year End: 0131<br />(Assistant Director Office: 2)<br />Get <a href="/cgi-bin/own-disp?action=getissuer&amp;CIK=0000104169"><b>insider transactions</b></a> for this <b> issuer</b>.
        <br />Get <a href="/cgi-bin/own-disp?action=getowner&amp;CIK=0000104169"><b>insider transactions</b></a> for this <b>reporting owner</b>.
      </p>
    </div>
  </div>
</div>

In trying to understand how VBA might be used to extract the SIC, I found the following post on your site:

Query and parse xml attribute value into XLS using VBA

I tried to apply barrowc's answer by copy/paste into an Excel module and inserted the path to the Wal Mart filings however when I step through I get the Debug.Print "*****" but I'm not getting anything for n.Text.

Sub test4()
    Dim d As MSXML2.DOMDocument60
    Dim i As IXMLDOMNodeList
    Dim n As IXMLDOMNode

    Set d = New MSXML2.DOMDocument60
    d.async = False
    d.Load ("http://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=886475&filenum=&State=&Country=&SIC=&owner=exclude&Find=Find+Companies&action=getcompany")

    Debug.Print "*****"
    Set i = d.SelectNodes("//div[@id='contentDiv']")
    For Each n In i
        Debug.Print n.Text
    Next n
    Debug.Print "*****"

    Set d = Nothing
End Sub

I've tried various strings in d.SelectNodes(), but I don't know enough about this topic to understand where I'm going wrong. So either a comment on my syntax or a pointer to a resource would be enormously helpful.

Upvotes: 1

Views: 3407

Answers (2)

mchac
mchac

Reputation: 347

Thanks again mwolfe. I've posted my code below but what you provided is much more elegant. I know that the SICs are only 4 digits so I was lazy and made that an assumption in the code and that could throw up errors in the future. You can see how I did it in the commented out part.

Sub GetSICs()
    Application.ScreenUpdating = False

    Dim AWBN As String
    Dim ASN As String
    Dim CIK As String
    Dim NUM_FILES_TO_GET As Long
    Dim COUNTER As Long
    Dim SICTagPos As Integer
    Dim SIC As String

    Set IEbrowser = CreateObject("InternetExplorer.application")
    IEbrowser.Visible = False
    AWBN = ActiveWorkbook.Name
    ASN = ActiveSheet.Name
    Workbooks(AWBN).Sheets(ASN).Range("A1").Select
    ActiveCell.Offset(0, 11) = "SIC"
    NUM_FILES_TO_GET = Application.WorksheetFunction.CountA(Range("A:A"))
    For COUNTER = 1 To 3 'NUM_FILES_TO_GET
        Application.StatusBar = "Counter = " & COUNTER
        'SICTagPos = 0
        CIK = ActiveCell.Offset(COUNTER, 2)
        IEbrowser.Navigate URL:="http://www.sec.gov/edgar/searchedgar/companysearch.html"
        Do
            DoEvents
        Loop Until IEbrowser.readyState = 4
        Set frm = IEbrowser.Document.forms(0)
        frm("CIK").Value = CIK
        frm.submit
        While IEbrowser.Busy Or IEbrowser.readyState <> 4: DoEvents: Wend
        SIC = ExtractSIC(IEbrowser.Document.body.innerhtml)
        'SICTagPos = InStr(1, IEbrowser.Document.body.innerhtml, "SIC=")
        'SIC = Right(Left(IEbrowser.Document.body.innerhtml, SICTagPos + 7), 4)
        ActiveCell.Offset(COUNTER, 11).NumberFormat = "@"
        ActiveCell.Offset(COUNTER, 11) = SIC

    Next

    Application.StatusBar = False
    Application.ScreenUpdating = True

End Sub


Function ExtractSIC(SourceHtml As String) As String
    Const PrefixChars As String = "&amp;SIC="
    Const SuffixChars As String = "&"
    Dim StartPos As Long, EndPos As Long
    StartPos = InStr(SourceHtml, PrefixChars)
    If StartPos = 0 Then Exit Function

    StartPos = StartPos + Len(PrefixChars)
    EndPos = InStr(StartPos, SourceHtml, SuffixChars) - 1
    ExtractSIC = Mid(SourceHtml, StartPos, EndPos - StartPos + 1)
End Function

Upvotes: 0

mwolfe02
mwolfe02

Reputation: 24207

If you are interested in just the SIC, it is not worth your time to try to parse the entire DOM structure. Instead, identify a unique set of characters, search for that and then extract the SIC from there.

The following function does just that. You just need to pass it the full HTML source of the page and it will return the SIC:

Function ExtractSIC(SourceHtml As String) As String
    Const PrefixChars As String = "&amp;SIC="
    Const SuffixChars As String = "&"
    Dim StartPos As Long, EndPos As Long
    StartPos = InStr(SourceHtml, PrefixChars)
    If StartPos = 0 Then Exit Function

    StartPos = StartPos + Len(PrefixChars)
    EndPos = InStr(StartPos, SourceHtml, SuffixChars) - 1
    ExtractSIC = Mid(SourceHtml, StartPos, EndPos - StartPos + 1)
End Function

Upvotes: 1

Related Questions