Aidan O'Farrell
Aidan O'Farrell

Reputation: 89

Webscraping with VBA morningstar financial

I'm trying to scrape the inside ownership from Morningstar at this url: http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US

This is the code I'm using:

Sub test()

    Dim appIE As Object

    Set appIE = CreateObject("InternetExplorer.Application")
    With appIE
        .Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US"
        .Visible = True
    End With
    While appIE.Busy
        DoEvents
    Wend
    Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")
    Debug.Print allRowOfData
    Dim myValue As String: myValue = allRowOfData.Cells(0).innerHTML
    appIE.Quit
    Set appIE = Nothing

    Range("A30").Value = myValue

End Sub

I get run-time error 13 at line

Set allRowOfData = appIE.Document.getElementById("currentInsiderVal")

but I can't see any mismatch. What is going on?

Upvotes: 1

Views: 1013

Answers (2)

omegastripes
omegastripes

Reputation: 12612

You can just do it with XHR and RegEx instead of cumbersome IE:

Sub Test()
    Dim sContent
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US", False
        .Send
        sContent = .ResponseText
    End With
    With CreateObject("VBScript.RegExp")
        .Pattern = ",""currInsiderVal"":(.*?),"
        Range("A30").Value = .Execute(sContent).Item(0).SubMatches(0)
    End With
End Sub

Here is the description how the code works:

First of all MSXML2.XMLHTTP ActiveX instance is created. GET request opened with target URL in synchronous mode (execution interrupts until response received).

Then VBScript.RegExp is created. By default .IgnoreCase, .Global and .MultiLine properties are False. The pattern is ,"currInsiderVal":(.*?),, where (.*?) is a capturing group, . means any character, .* - zero or more characters, .*? - as few as possible characters (lazy matching). Other characters in pattern to be found as is. .Execute method returns a collection of matches, there is only one match object in it since .Global is False. This match object has a collection of submatches, there is only one submatch in it since the pattern contains the only capturing group.
There are some helpful MSDN articles on regex:
Microsoft Beefs Up VBScript with Regular Expressions
Introduction to Regular Expressions

Here is the description how I created the code:

First I found an element containing the target value on the webpage DOM using browser:

target value

The corresponding node is:

<td align="right" id="currrentInsiderVal">143.51</td>

Then I made XHR and found this node in the response HTML, but it didn't contain the value (you can find response in the browser developer tools on network tab after you refresh the page):

<td align="right" id="currrentInsiderVal">
</td>

Such behavior is typical for DHTML. Dynamic HTML content is generated by scripts after the webpage loaded, either after retrieving a data from web via XHR or just processing already loaded withing webpage data. Then I just searched for the value 143.51 in the response, the snippet ,"currInsiderVal":143.51, located within JS function:

            fundsArr = {"fundTotalHistVal":132.61,"mutualFunds":[[1,89,"#a71620"],[2,145,"#a71620"],[3,152,"#a71620"],[4,198,"#a71620"],[5,155,"#a71620"],[6,146,"#a71620"],[7,146,"#a71620"],[8,132,"#a71620"]],"insiderHisMaxVal":3.535,"institutions":[[1,273,"#283862"],[2,318,"#283862"],[3,351,"#283862"],[4,369,"#283862"],[5,311,"#283862"],[6,298,"#283862"],[7,274,"#283862"],[8,263,"#283862"]],"currFundData":[2,2202,"#a6001d"],"currInstData":[1,4370,"#283864"],"instHistMaxVal":369,"insiders":[[5,0.042,"#ff6c21"],[6,0.057,"#ff6c21"],[7,0.057,"#ff6c21"],[8,3.535,"#ff6c21"],[5,0],[6,0],[7,0],[8,0]],"currMax":4370,"histLineQuars":[[1,"Q2"],[2,"Q3"],[3,"Q4"],[4,"Q1<br>2015"],[5,"Q2"],[6,"Q3"],[7,"Q4"],[8,"Q1<br>2016"]],"fundHisMaxVal":198,"currInsiderData":[3,143,"#ff6900"],"currFundVal":2202.85,"quarters":[[1,"Q2"],[2,""],[3,""],[4,"Q1<br>2015"],[5,""],[6,""],[7,""],[8,"Q1<br>2016"]],"insiderTotalHistVal":3.54,"currInstVal":4370.46,"currInsiderVal":143.51,"use10YearData":"false","instTotalHistVal":263.74,"maxValue":369};

So the regex pattern created based on that it should find the snippet ,"currInsiderVal":<some text>, where <some text> is our target value.

Upvotes: 1

Dave
Dave

Reputation: 4356

Had a look on the site and the element you are trying to retrieve has a typo in it; instead of currentInsiderVal try using currrentInsiderVal and you should retrieve the data correctly.

Probably worth considering some error trapping to catch stuff like this for any other fields you retrieve?

After your comment I took a closer look. Your issue seemed like it was trying to trap the id of the individual cell rather than navigating down the object tree. I've modified the code to retrieve the row of the table you are after and then set myValue to be the correct cell within that row. Seemed to be working when I tried it out. Give this a shot?

Sub test()

Dim appIE As Object
Set appIE = CreateObject("internetexplorer.application")


With appIE
    .Navigate "http://investors.morningstar.com/ownership/shareholders-overview.html?t=TWTR&region=usa&culture=en-US"
    .Visible = True
End With

While appIE.Busy
    DoEvents
Wend

Set allRowOfData = appIE.Document.getelementbyID("tableTest").getElementsByTagName("tbody")(0).getElementsByTagName("tr")(5)
myValue = allRowOfData.Cells(2).innerHTML

appIE.Quit
Set appIE = Nothing
Range("A30").Value = myValue
End Sub

Upvotes: 0

Related Questions