Juuun
Juuun

Reputation: 13

VBA web scraping contents without class name or ID

I would like to scrape Dividend Future Prices from HKEX.

Here's the URL of this site : http://www.hkex.com.hk/Market-Data/Futures-and-Options-Prices/Equity-Index/HSCEI-Dividend-Futures?sc_lang=en#&product=DHH

I wanted to scrape Prev.Day settlement price of the "Dec-19 Contract" via VBA. However, it doesn't have any class name or id, so I have no idea how to access the information.

<tr>
  <td>Dec-19</td>
  <td>-</td>
  <td>-</td>
  <td>413.78</td>
  <td>
    -
    <br>
    -
  </td>
  <td>-</td>
  <td>
    -
    <br>
    -
  </td>
    <td>-<td>
    <td>17,330</td>
  </tr>

How can I scrape this via VBA?

Upvotes: 1

Views: 2497

Answers (3)

QHarr
QHarr

Reputation: 84465

You could also simply use a CSS selector and no loop:

html.querySelectorAll("td:nth-child(4)")(1).innerText

This method is fragile. If the style on the page changes this may break.


CSS selector:

If you observe the relevant part of the page (showing first contract year with headers for context and with chart between contract years removed):

Contract info

The associated HTML for contract year 2019 is:

HTML

Prev.Day Settlement Price is the 4th td within this i.e. CSS selector td:nth-child(4).

This pattern is repeated for all contract years so you can return a nodeList of all matches to this (i.e. every td:nth-child(4) with the .querySelectorAll method).

Year 2019 is at index position 1; this is the second element in a 0 based indexed nodeList, so you access with .querySelectorAll("td:nth-child(4)")(1).


CSS query result - first few results:

CSS query

Upvotes: 1

SIM
SIM

Reputation: 22440

It's really the hell of an automation to find specific item with no remarkable flag attached to it. However, I've written this script without hardcoding index to the elements. Give this a shot and get your desired values:

Sub Hkex_Data()

    Dim IE As New InternetExplorer, html As HTMLDocument
    Dim posts As Object

    With IE
        .Visible = False
        .navigate "http://www.hkex.com.hk/Market-Data/Futures-and-Options-Prices/Equity-Index/HSCEI-Dividend-Futures?sc_lang=en#&product=DHH"
        Do Until .readyState = READYSTATE_COMPLETE: Loop
        Set html = .document
    End With
    Application.Wait (Now + TimeValue("0:00:05"))

    For Each posts In html.getElementsByClassName("hsirowcon")
        Row = Row + 1: Cells(Row, 1) = posts.NextSibling.NextSibling.FirstChild.innerText
        Cells(Row, 2) = posts.NextSibling.NextSibling.LastChild.innerText
    Next posts

    IE.Quit
End Sub

Result:

19-Dec  17,330

Reference to add to the library:

Microsoft internet controls
Microsoft Html Object Library

Upvotes: 1

MarcinSzaleniec
MarcinSzaleniec

Reputation: 2256

Use getElementsByTagName. Identify your and then go through each row and each td in rows. Something like that.

Dim objTR As IHTMLElement
Dim objTD As IHTMLElement
Dim objTable As IHTMLElement

For Each objTR In objTable.getElementsByTagName("tr")
    For Each objTD In objTR
        'do something with objtd.innerText
    Next objTD
Next objTR

or you can declare your variables as Object if you prefer late binding.

Upvotes: 1

Related Questions