Arohi Sharma
Arohi Sharma

Reputation: 13

How to get innerText of a tag in VBA excluding text from nested tags?

I am using VBA for web scraping. Below is the html structure and my VBA code.

When I run it I am getting this text ETA : 2020-08-26 (Reference only, the date will be updated according to shipments).

But I want to scrape only the date from it 2020-08-26

 <div style="font-size: 14px;">
     <span class="label" style="font-weight: bolder; font-size: 13px;">ETA : </span>
     <br>
     2020-08-26 
    <span style="color: red; font-size: 12px;">(Reference only, the date will be updated according to 
     shipments).</span>
</div>

VBA Code>

 Dim ie As New InternetExplorer
    Dim doc As New HTMLDocument
    
    ie.navigate "http://127.0.0.1/wordpress/sample-page/"
    Do
    DoEvents
    Loop Until ie.readyState = READYSTATE_COMPLETE
    
    Set doc = ie.document
    Set elems = doc.getElementsByTagName("div")
    MsgBox elems(33).innerText

Upvotes: 1

Views: 1335

Answers (4)

Tim Williams
Tim Williams

Reputation: 166331

Dim html, divs, d, c

Set html = CreateObject("htmlfile")
html.body.innerHTML = "<div style='font-size: 14px;'><span class='label' style='font-weight: bolder; font-size: 13px;'>ETA : </span>" & _
 "<br>2020-08-26" & _
"<span style='color: red; font-size: 12px;'>(Reference only, the date will be updated according toshipments).</span>" & _
 "</div>"

Set divs = html.getElementsByTagName("div")
For Each d In divs
    For Each c In d.ChildNodes
        Debug.Print TypeName(c), c.nodeName, c.NodeValue
    Next c
Next d

output:

HTMLSpanElement             SPAN          Null
HTMLBRElement               BR            Null
DispHTMLDOMTextNode         #text         2020-08-26
HTMLSpanElement             SPAN          Null

Upvotes: 1

JMP
JMP

Reputation: 4467

This code finds any date of the form ####-##-##.

Cells.Clear
s = "ETA : 2020-08-26 (Reference only, the date will be updated according to shipments)."
ReDim a(1 To Len(s))
For i = 1 To Len(s)
a(i) = IIf(Mid(s, i, 1) Like "#", "#", Mid(s, i, 1))
Next i
fd = "####-##-##"
Cells(1, 1) = s
aa = Join(a, "")
Cells(2, 1) = aa
Cells(3, 1) = Mid(s, InStr(aa, fd), Len(fd))
Cells(3, 1).NumberFormat = "yyyy-mm-dd"

First it splits the string into an array, and replaces all digits with a #. Then it uses InStr to find a match to the pattern template fd, and uses the return value from the match to return the actual date.

Upvotes: 1

braX
braX

Reputation: 11755

Once you have the string, you can just use a combination of Instr, Mid, and Trim to get the date:

Sub test()
  Dim sSource As String
  Dim nStart As Integer
  Dim nEnd As Integer
  Dim sResult As String
  Dim dtDate As Date
  
  sSource = "ETA : 2020-08-26 (Reference only, the date will be updated according to shipments)"
  nStart = InStr(sSource, ":")
  nEnd = InStr(sSource, "(")
  
  sResult = Trim$(Mid$(sSource, nStart + 1, nEnd - nStart - 1))
  If IsDate(sResult) Then
    dtDate = CDate(sResult)
    MsgBox "Success: " & dtDate
  Else
    MsgBox sResult & " is not a date"
  End If

End Sub

Upvotes: 1

Zwenn
Zwenn

Reputation: 2267

You can do that by string manipulation or by the path through the DOM. Here is the solution with the path.

Sub SelectFromDropdown()

  Dim url As String
  Dim browser As Object
  Dim nodeDiv As Object
  
  url = "Your URL Here"
  
  'Initialize Internet Explorer, set visibility,
  'call URL and wait until page is fully loaded
  Set browser = CreateObject("internetexplorer.application")
  browser.Visible = True
  browser.navigate url
  Do Until browser.readyState = 4: DoEvents: Loop
  
  'Istead of (0) it's (33) in your code
  'However, I do not recommend the use of such high indices,
  'as they can lead to unstable behaviour. Just add a div tag
  'before the index and the macro will not work anymore. This
  'does not apply if you loop through an HTML section that has
  'been selected as a container of exactly these div tags.
  Set nodeDiv = browser.document.getElementsByTagName("div")(0)
  
  'To get only the date you can go through the DOM path
  'You want a text node of the DOM (Document Object Model)
  'So innertext doesn't work. You need the NodeValue
  MsgBox nodeDiv.FirstChild.NextSibling.NextSibling.NextSibling.NextSibling.NodeValue
End Sub

Upvotes: 0

Related Questions