YasserKhalil
YasserKhalil

Reputation: 9538

Scrape table for nested table in local html using selenium

In this thread [Link}(Scraping table from local HTML with unicode characters), QHarr has helped me to scrape a table from local html file. I have a html file at this Link

And I used the same code and edited a little for the variables 'startTableNumber' and 'endTableNumber' and 'numColumns'

Public Sub Test()
Dim fStream  As ADODB.Stream, html As HTMLDocument
Set html = New HTMLDocument
Set fStream = New ADODB.Stream
With fStream
    .Charset = "UTF-8"
    .Open
    .LoadFromFile "C:\Users\Future\Desktop\Sample 2.html"
    html.body.innerHTML = .ReadText
    .Close
End With

Dim hTables As Object, startTableNumber As Long, i As Long, r As Long, c As Long
Dim counter As Long, endTableNumber, numColumns As Long

startTableNumber = 91
endTableNumber = 509
numColumns = 14

Set hTables = html.getElementsByTagName("table")
r = 2: c = 1

For i = startTableNumber To endTableNumber Step 2
    counter = counter + 1
    If counter = 10 Then
        c = 1: r = r + 1: counter = 1
    End If
    Cells(r, c) = hTables(i).innerText
    c = c + 1
Next

End Sub

But I got scattered data of the table further more I would like to find a flexible way so as to make the code recognize those variables without assigning them manually I hope to find solution using selenium. Hope also not to receive negative rep. I have done my best to clarify the issue Regards

Upvotes: 1

Views: 193

Answers (1)

QHarr
QHarr

Reputation: 84465

So, as I said in my comments you need to study how the data appears in the later table tags and perform a mapping to get the correct ordering. The following writes out the table. As I also mentioned, this is not robust and only the methodology may possibly be transferable to other documents.

In your case you wouldn't be reading from file but would use

Set tables = driver.FindElementsByCss("table[width='100%'] table:first-child")

You would then For Each over the web elements in the collection adjusting the syntax as required e.g. .Text instead of .innerText. There may be a few other adaptations for selenium due to its indexing of webElements but everything you need to should be evident below.

VBA:

Option Explicit
Public Sub ParseInfo()
    Dim html As HTMLDocument, tables As Object, ws As Worksheet, i As Long
    Set ws = ThisWorkbook.Worksheets("Sheet2")
    Dim fStream  As ADODB.Stream
    Set html = New HTMLDocument
    Set fStream = New ADODB.Stream
    With fStream
        .Charset = "UTF-8"
        .Open
        .LoadFromFile "C:\Users\User\Desktop\test.html"
        html.body.innerHTML = .ReadText
        .Close
    End With

    Set tables = html.querySelectorAll("table[width='100%'] table:first-child")
    Dim rowCounter: rowCounter = 2
    Dim mappings(), j As Long, headers(), arr(13)
    headers = Array("Notes", "Type", "Enrollment status", "Governorate of birth", "Year", "Month", "Day", "Date of Birth", "Religion", _
    "Nationality", "Student Name", "National Number", "Student Code", "M")

    mappings = Array(3, 8, 9, 12, 11, 10, 2, 7, 1, 6, 5, 4, 13)
    ws.Cells(1, 1).Resize(1, UBound(headers) + 1) = headers

    For i = 89 To 504 Step 26
        arr(0) = vbNullString

        For j = 0 To 12
            arr(mappings(j)) = tables.item(2 * j + i).innerText
        Next

        ws.Cells(rowCounter + 1, 1).Resize(1, UBound(arr) + 1) = arr
        rowCounter = rowCounter + 1
    Next
End Sub

Upvotes: 1

Related Questions