Dmitrij Holkin
Dmitrij Holkin

Reputation: 2055

Right syntax in HTML scraping

I have a code which dynamically changing

<tbody>
' ------------------- Block 1 ----------------------
   <tr class="table-row">
      <td class="cell">
         <div>18/4/2018</div>
      </td>
      <td class="cell">
         <div>
            <form id="idc" method="post" action=""> ' id is dinamic so cant use it
               <div style=""><input type="hidden" name="idc_hf_0" id="idc_hf_0" /></div> ' id and name is dinamic so cant use them
               Download all invoice documents as ZIP-file
               <span>
               <a class="icon zipdownload" title="Download all invoice documents as ZIP-file" href=""></a>
               </span>
               <span class="has-explanation">
               <a class="helper" href="javascript:;" title="The zip-file contains only PDF files of Tax/Fee statements and the Fleet Invoice with all annexes if available.">
               <span class="icon question" id="table-header-explanation"></span>
               </a>
               </span>
            </form>
         </div>
      </td>
      <td class="cell">
         <div>
            <a class="" title="View &gt;&gt;" href="">View &gt;&gt;</a>
         </div>
      </td>
   </tr>
 ' ################### Block1 END #######################
 
 ' ------------------- Block 2 ----------------------
   <tr class="table-row">
      <td class="cell">
         <div>13/4/2018</div> ' need this
      </td>
      <td class="cell">
         <div>
            <form id="idd" method="post" action="">
               <div style=""><input type="hidden" name="idd_hf_0" id="idd_hf_0" /></div>
               <div>
                  <span>Collective Payment Order</span> (<span>2018-500421707</span>)
                  <span>
                  <span class="invisible"> | </span><span>
                  <a class="Download" title="Download" href="">English</a>
                  </span>
                  </span>
               </div>
               <div>
                  <span>Tax/Fee CSV list</span> <span>
                  <a class="icon csv" title="Download" href=""></a>  ' need this  HREF1
                  </span>
               </div>
               <div>
                  <span>Detailed Trip CSV list</span> <span>
                  <a class="icon csv" title="Download" href=""></a> ' need this HREF2
                  </span>
               </div>
               Download all invoice documents as ZIP-file
               <span>
               <a class="icon zipdownload" title="Download all invoice documents as ZIP-file" href=""></a>
               </span>
               <span class="has-explanation">
               <a class="helper" href="javascript:;" title="The zip-file contains only PDF files of Tax/Fee statements and the Fleet Invoice with all annexes if available.">
               <span class="icon question" id="table-header-explanation"></span>
               </a>
               </span>
            </form>
         </div>
      </td>
      <td class="cell">
         <div>
            <a class="" title="View &gt;&gt;" href="">View &gt;&gt;</a>
         </div>
      </td>
   </tr>
  ' ################### Block2 END #######################
  
<tbody>

So there are two blocks which are dynamic. So can be such structure

Block1
Block1
Block2
Block1
Block2
Block2
Block2
Block1

I need get from this blocks:

  1. Count of Block2
  2. Date of each block2
  3. HREF1 from class="icon csv"
  4. HREF2 from class="icon csv"

differentiate between block 1 and 2 Block 1 does not have class="icon csv" or by <span>Tax/Fee CSV list</span> <span>

I confused how to use getelement properties, trying to get

Set IeDoc = IeApp.Document
    With IeDoc
        Set IeTbody = .getElementsByTagName("tbody").getElementsByClassName("table-row")
        d = IeTbody.legth
        For Each stEl In IeTbody
            
        Next stEl

    End With

But got error "Object does not support this property or method", maybe use better querySelector? How is got links?

logical it must be something like

Set IeDoc = IeApp.Document
    With IeDoc
        Set Blocks = .getElementsByTagName("tbody")

    For Each block In Blocks
        Set hasClass = .getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")
        if not hasClass is nothing then
            b.Date = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(0).getElementsByTagName("div")(0).innerText()
            b.Href1 = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")(0)
            b.Href2 = Blocks(block).getElementsByClassName("table-row").getElementsByClassName("cell")(1).getElementsByClassName("icon csv")(1)
        end if
    Next block

End With

Upvotes: 0

Views: 63

Answers (1)

QHarr
QHarr

Reputation: 84465

So this isn't very robust but was a play around with Regex and parsing the HTML you gave. Look behind would help to pull in date with regex split but I couldn't work that out at present. I have currently adapted a regex function by @FlorentB

Public Matches As Object
' Or add in Tools > References > VBScript Reg Exp for early binding
Public Sub testing()
    Dim str As String, countOfBlock2   As Long, arr() As String, i As Long
    str = Range("A1") 'I am reading in from sheet but this would be your response text
    arr = SplitRe(str, "\<div>[\d]+[\/-][\d]+[\/-][\d]+\<\/div>") 'look behind would help

    For i = LBound(arr) To UBound(arr)

        If InStr(1, arr(i), "class=""icon csv""") > 0 Then
           countOfBlock2 = countOfBlock2 + 1 ' "Block 2"
           Debug.Print Replace(Replace(Matches(i - 1), "<div>", ""), "</div>", "") 'dates from Block 2
           Debug.Print Split(Split(arr(i), """icon csv"" title=""Download"" href=")(1), "></a>")(0)
           Debug.Print Split(Split(arr(i), """icon csv"" title=""Download"" href=")(2), "></a>")(0)
        End If

   Next i

   Debug.Print "count of block2 = " & countOfBlock2

End Sub

    'https://stackoverflow.com/questions/28107005/splitting-string-in-vba-using-regex?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
Public Function SplitRe(Text As String, Pattern As String, Optional IgnoreCase As Boolean) As String()
    Static re As Object

    If re Is Nothing Then
        Set re = CreateObject("VBScript.RegExp")
        re.Global = True
        re.MultiLine = True
    End If

    re.IgnoreCase = IgnoreCase
    re.Pattern = Pattern
    SplitRe = Strings.Split(re.Replace(Text, ChrW(-1)), ChrW(-1))

     Set Matches = re.Execute(Text)

End Function

Output:

Output

Upvotes: 1

Related Questions