Simon_A_Drax
Simon_A_Drax

Reputation: 5

Web Scraping - COVID19 incidents

I have some questions reagrding an Excel VBA program that I want to build.

Basically it's pretty easy. I want to access the following website https://coronavirus.jhu.edu/map.html and extract the Confirmed Cases by Country/Region/Sovereignty (it's the table on the very left of the dashborad) and paste the values in excel.

I know all the basic stuff on how to setup an internetexplorer instance and scraping the page by tags, classes, ids etc. But I think in this sceanrio I cannot use the basic things. I guess it's pretty tricky actually. The information I am looking for is within some tags. But I cannot get their textcontent when I use the getelementsbytagname("strong") approach.

Could someone help me in this case?

I am grateful for any hints, advices and solutions.

Below you'll find the start of my code.

Best Simon

 Sub test()
    Dim ie As InternetExplorer
    Dim html As HTMLDocument
    Dim i As Integer
    Dim obj_coll As IHTMLElementCollection
    Dim obj As HTMLObjectElement


    Set ie = New InternetExplorer
    ie.Visible = False


    ie.navigate "https://coronavirus.jhu.edu/map.html"

    Do Until ie.readyState = READYSTATE_COMPLETE
    DoEvents
    Loop

    Debug.Print "Successfully connected with host"
    Set html = ie.document

    Set obj_coll = html.getElementsByTagName("strong")

    For Each obj In obj_coll

    Debug.Print obj.innerText


    Next obj

    ie.Quit
    End Sub

Upvotes: 0

Views: 250

Answers (1)

QHarr
QHarr

Reputation: 84465

You can use the iframe url direct to navigate to. You then need a timed wait to ensure the data has loaded within that iframe. I would then collect nodeLists via faster css selectors. As the nodeLists (one for figures and the other for locations) are the same length you will only need a single loop to index into both lists to get rows of data.

Option Explicit

Public Sub GetCovidFigures()

    Dim ie As SHDocVw.InternetExplorer

    Set ie = New SHDocVw.InternetExplorer

    Dim t As Date
    Const MAX_WAIT_SEC As Long = 30

    With ie
        .Visible = True
        .Navigate2 "https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6"

        Do
            DoEvents
        Loop While .Busy Or .readyState <> READYSTATE_COMPLETE

        t = Timer
        Do
            If Timer - t > MAX_WAIT_SEC Then Exit Sub
        Loop While .document.querySelectorAll(".feature-list strong").Length = 0

        Dim figures As Object, location As Object, results(), i As Long

        Set figures = .document.querySelectorAll("h5 strong")
        Set location = .document.querySelectorAll("h5 span:last-child")

        ReDim results(1 To figures.Length, 1 To 2)

        For i = 0 To figures.Length - 1
            results(i + 1, 1) = figures.item(i).innerText
            results(i + 1, 2) = location.item(i).innerText
        Next

        .Quit
    End With

    ActiveSheet.Cells(1, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

Consider how frequently you want this. There are large numbers of APIs popping up to supply this data which you could instead issue faster xhr requests to. Additionally, you could simply take the source data in csv form from github here. *Files after Feb 1 (UTC): once a day around 23:59 (UTC). There is a rest API visible in dev tools network tab that is frequently supplying new data in json format which is used to update the page. That can be accessed via Python + requests or R + httr modules for example. I suspect this endpoint is not intended to be hit so look for public APIs.

Upvotes: 2

Related Questions