Nick
Nick

Reputation: 775

Pulling text from website into Excel by Using VBA

I am slowly exploring if I can use VBA to code a macro that will search a website from a list of keywords/codes in column A and extract the data. Currently The code below searches the desired website using the range in ("A1") only but does get to the right page with the data I wish to extract. In this case the Code in a1 is 100-52-7

Sub BrowseToSite()

Dim IE As New SHDocVw.InternetExplorer
Dim HTMLDoc As MSHTML.HTMLDocument

IE.Visible = True
IE.Navigate "https://apps.who.int/food-additives-contaminants-jecfa-database/Search.aspx"

Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop

IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$txtSearch").Value = Range("a1").Value
IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click

Set HTMLDoc = IE.Document
'Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText


End Sub

Now I wish to pull the "0-5 mg/kg bw (1996)" phrase on this page into Excel. I planned to do this by retriving the inner text within the class name however I run into an error Object Variable or With Block variable not set with the following line:

Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText

Upvotes: 2

Views: 1241

Answers (2)

SIM
SIM

Reputation: 22440

You can get rid of IE altogether and try using xmlhttp requests to make the script robust. What the following script does is send a get http requests first to scrape the value of certain parameters supposed to be used within post requests and then issue a post requests to parse the desired content.

This is one of the efficient ways how you can:

Option Explicit
Public Sub GetContent()
    Const Url = "https://apps.who.int/food-additives-contaminants-jecfa-database/Search.aspx"
    Dim oHttp As Object, oHtml As HTMLDocument, MyDict As Object
    Dim DictKey As Variant, payload$, searchKeyword$
    
    Set oHtml = New HTMLDocument
    Set oHttp = CreateObject("MSXML2.XMLHTTP")
    Set MyDict = CreateObject("Scripting.Dictionary")
    
    'send get requests first to parse the value of "__VIEWSTATE", "__VIEWSTATEGENERATOR" e.t.c., as in oHtml.getElementById("__VIEWSTATE").Value
    
    With oHttp
        .Open "GET", Url, False
        .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
        .send
        oHtml.body.innerHTML = .responseText
    End With
    
    searchKeyword = "100-52-7" 'this is the search keyword you wanna use from your predefined search terms
    
    'MyDict stores keys and values within dictionary, as in __VIEWSTATE = "some value" and so on
    
    MyDict("__VIEWSTATE") = oHtml.getElementById("__VIEWSTATE").Value
    MyDict("__VIEWSTATEGENERATOR") = oHtml.getElementById("__VIEWSTATEGENERATOR").Value
    MyDict("__EVENTVALIDATION") = oHtml.getElementById("__EVENTVALIDATION").Value
    MyDict("ctl00$ContentPlaceHolder1$txtSearch") = searchKeyword
    MyDict("ctl00$ContentPlaceHolder1$btnSearch") = "Search"
    MyDict("ctl00$ContentPlaceHolder1$txtSearchFEMA") = ""

    'joining each set of key and value with ampersand to make it a string so that you can use it as a parameter while issuing post requests, which is what payload is doing
    
    payload = ""
    For Each DictKey In MyDict
        payload = IIf(Len(DictKey) = 0, WorksheetFunction.encodeURL(DictKey) & "=" & WorksheetFunction.encodeURL(MyDict(DictKey)), _
        payload & "&" & WorksheetFunction.encodeURL(DictKey) & "=" & WorksheetFunction.encodeURL(MyDict(DictKey)))
    Next DictKey
    
    With oHttp
        .Open "POST", Url, False
        .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36"
        .setRequestHeader "Content-type", "application/x-www-form-urlencoded"
        .send (payload)
        oHtml.body.innerHTML = .responseText
    End With
    
    MsgBox oHtml.querySelector("#SearchResultItem > a").NextSibling.NodeValue
    
End Sub

Make sure to add the following libraries to execute the above script:

Microsoft XML, v6.0
Microsoft Scripting Runtime
Microsoft HTML Object Library

Upvotes: 3

K.Dᴀᴠɪs
K.Dᴀᴠɪs

Reputation: 10139

You click on an element with this line of code:

IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click

for which IE makes a POST request to retrieve your results, as can be seen here:

enter image description here The above is a screen shot from Edge's dev tools, but concept is the same

During this request, the element in question is not immediately there, so you will need to wait for it to load.

Your prior method of

Do While IE.ReadyState <> READYSTATE_COMPLETE
Loop

would probably work, but I find it to be inconsistent at times and would also include checking the .Busy property as well.

Try using this after your click:

IE.Document.forms("form1").Elements("ctl00$ContentPlaceHolder1$btnSearch").Click

'~~WAIT FOR SEARCH RESULTS TO LOAD~~
Do While IE.ReadyState < READYSTATE_COMPLETE Or IE.Busy
Loop

Set HTMLDoc = IE.Document
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText

If you're still having issues, you can force IE to wait for the element in question to become available by doing this:

On Error Resume Next
Do while HTMLDoc.getElementsByClassName("sectionHead1")(0) is Nothing
Loop
On Error Goto 0

Set HTMLDoc = IE.Document
Sheet1.Range("B1").Value = HTMLDoc.getElementsByClassName("sectionHead1")(0).innerText

This is a simple loop that checks for the object, and will continue to loop until that object is no longer Nothing (which essentially means it has loaded).

And I would recommend that you add some sort of timeout that may trigger an error or something just in case the webpage is having issues so you're not in an infinite loop.

Pro Tip:

If you are clicking the search button a lot of times and waiting for a lot of objects to load, instead of duplicating the above code you can turn it into it's own sub and do something like:

Sub WaitForElement(IE as InternetExplorer, elem As Object)
    
    Do While IE.ReadyState < 4 Or IE.Busy: Loop

    On Error Resume Next
    Do While elem is Nothing: Loop
    On error Goto 0

End Sub

Then you would just need to use the following line after each click:

WaitForElement IE, HTMLDoc.getElementsByClassName("sectionHead1")(0)

Not only would this cut down on the number of lines in your code, it could greatly improve readability as well.

Upvotes: 1

Related Questions