SIM
SIM

Reputation: 22440

Unable to parse all the links from a webpage

Can't figure out how can I get all the company links from the page used in my code. Running my script I get only 20 links. The page has got lazyloading method that is why I can't get all of them. Any input on this will be highly appreciated. I've tried so far with:

Sub Company_links()
Const lnk = "http://fortune.com"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim topic As Object

With http
    .Open "GET", "http://fortune.com/fortune500/list/", False
    .send
    html.body.innerHTML = .responseText
End With

For Each topic In html.getElementsByClassName("small-12 column row")
    x = x + 1
    With topic.getElementsByTagName("a")
        If .Length Then Cells(x, 1) = lnk & Split(.item(0).href, "about:")(1)
    End With
Next topic

Set html = Nothing: Set topics = Nothing
End Sub

Upvotes: 0

Views: 113

Answers (3)

ASH
ASH

Reputation: 20302

I would do it like this.

Option Explicit

Sub Sample()
    Dim ie As Object
    Dim links As Variant, lnk As Variant
    Dim rowcount As Long

    Set ie = CreateObject("InternetExplorer.Application")
    ie.Visible = True
    ie.navigate "http://fortune.com"

    'Wait for site to fully load
    'ie.Navigate2 URL
    Do While ie.Busy = True
       DoEvents
    Loop

    Set links = ie.document.getElementsByTagName("a")

    rowcount = 1

    With Sheets("Sheet1")
        For Each lnk In links
        'Debug.Print lnk.innerText
            'If lnk.classname Like "*Real Statistics Examples Part 1*" Then
                .Range("A" & rowcount) = lnk.innerText
                rowcount = rowcount + 1
                'Exit For
            'End If
        Next
    End With
End Sub

Upvotes: 0

Tehscript
Tehscript

Reputation: 2556

Run the following code in a new workbook. It will output to results to Sheet1 regardless of they are empty or not, so be careful if you have data there. You can change this part of code later as you like.

First of all you need to activate Microsoft HTML Object Library and Microsoft Internet Controls from Tools -> References in VBA Editor. Then run the following code, sit back and relax until you see "All Done!" message:

Sub Company_links()
    Dim i As Long
    Dim aIE As InternetExplorer
    Dim Rank As IHTMLElement, Company As IHTMLElement, Revenues As IHTMLElement
    Set aIE = New InternetExplorer
    With aIE
        .navigate "http://fortune.com/fortune500/list/"
        .Visible = True
    End With

    Do While (aIE.Busy Or aIE.ReadyState <> READYSTATE_COMPLETE)
        DoEvents
    Loop

    For i = 1 To 50

        On Error Resume Next
        Set Rank = aIE.document.getElementsByClassName("column small-2 company-rank")(999)
        If Rank Is Nothing Then
            GoTo Skip
        End If
        Exit For
Skip:
    SendKeys "{end}"
    Application.Wait (Now() + TimeValue("00:00:005"))
    Next i

    With Sheet1
        .Range("A1") = "RANK"
        .Range("B1") = "COMPANY"
        .Range("C1") = "REVENUE"

        For i = 0 To 999
            Set Rank = aIE.document.getElementsByClassName("column small-2 company-rank")(i)
            Set Company = aIE.document.getElementsByClassName("column small-5 company-title")(i)
            Set Revenues = aIE.document.getElementsByClassName("column small-5 company-revenue")(i)
            .Range("A" & i + 2) = Rank.innerText
            .Range("B" & i + 2) = Company.innerText
            .Range("C" & i + 2) = Revenues.innerText
        Next i

    End With

    SendKeys "%{F4}"
    Set aIE = Nothing
    Set Rank = Nothing
    Set Company = Nothing
    Set Revenues= Nothing
    MsgBox "All Done!"
End Sub

Upvotes: 1

Max08
Max08

Reputation: 1025

If the site is using ajax to load the remaining links. You would need to make the page load those remaining links first. My suggestion would be to use selenium to load the page and then use your code to get the links.

http://selenium-python.readthedocs.io/

Upvotes: 0

Related Questions