Zakiirim
Zakiirim

Reputation: 81

Webscraping : How to match a string with Instr function

i'm performing a web-scraping on the ecb's website for annual report in order to practice more. After i find all pdf's href of the page, i get loads of string like this:

https://www.ecb.europa.eu/pub/pdf/annrep/ar2016en.pdf?cb49eb74de9ddf1f55ebe03fb610d05b
https://www.ecb.europa.eu/pub/pdf/annrep/ar2015en.pdf?2e7998c5daf6a2a7e4bfccb41e81b504
https://www.ecb.europa.eu/pub/pdf/annrep/ar2014en.pdf?20def41d1b09b84d5889c707f92c9e4a
https://www.ecb.europa.eu/pub/pdf/annrep/ar2013en.pdf?fad3a17bf210c3c411c6e3c3121eb8a1
https://www.ecb.europa.eu/pub/pdf/annrep/ar2012en.pdf?40f7b4588f9adb8cf61ce44014c1b088

And so on.

Now i would like to perform an action that if the string that the user submit is CONTAINED in one of those href, it clicks on the href. (for example i insert 2015 and it clicks on the second href)

I tried with Instr but it works only if i insert the full href.

My code is this:

Sub prova()

Dim Ie As New SHDocVw.InternetExplorer
Dim Iedoc As MSHTML.HTMLDocument
Dim element As Object
Dim elements As MSHTML.IHTMLElementCollection
Dim parameter As String

parameter = "2015" 'i will insert application.inputbox


With Ie:
    .navigate "https://www.ecb.europa.eu/pub/annual/html/index.en.html"
    .Visible = True
End With

While Ie.readyState <> READYSTATE_COMPLETE Or Ie.Busy: DoEvents: Wend

Set Iedoc = Ie.document

Set elements = Iedoc.getElementsByClassName("pdf")

For Each element In elements:
    If InStr(1, parameter, element) Then
    element.Click
    End If
    Debug.Print element
    Next element

Upvotes: 1

Views: 190

Answers (1)

QHarr
QHarr

Reputation: 84465

Instr expects a string, not an object, as the param to search in.

Syntax

InStr([ start ], string1, string2, [ compare ])

The ordering is also:

string1 Required. String expression being searched.

string2 Required. String expression sought

Dependant on which string you are searching for, and its location, you might choose InStrRev to search from the end of the source string for a faster match. Note the arguments are then:

InstrRev(stringcheck, stringmatch, [ start, [ compare ]])

Technically, I think it is a param in the signature but an argument when value passed. Though someone can correct me if wrong.


You should use the href

InStr(1, href, param) >0

at a push you could use the outerHTML but you have a larger search space so less efficient.

It is yet more efficient to simply use the DOM parser to filter the results using a css attribute = value selector with contains * , starts with ^, or ends with $ operator:

contains operator:

Iedoc.querySelector("[href*='" &  parameter & "'").click

It would be safer to test for a longer substring in the href attribute so something like:

param = 2015 
Iedoc.querySelector(".doc-title [href*='/pub/annual/html/ar" & param & "']").click

then you get rid of entire loop.


Side-notes:

In your current loop you would also likely want an Exit For after match found.

Debug.Print element will, if match found, simply print [Object].

You would want to access a property of the element itself e.g. .innerText. However, given you just clicked on it, you risk a stale element exception bubbling up (or some other error) if element is now no longer attached to the DOM.

Upvotes: 2

Related Questions