Reputation: 1516
I am having problem scraping some HTML
.
Here is the URL where my macro
is scrapping and below is the excerpt of the code:
Set els = IE.Document.getelementsbytagname("a")
For Each el In els
If Trim(el.innertext) = "Documents" Then
colDocLinks.Add el.href
End If
Next el
As you can see if you open the URL
we run into search results; then the macro finds all links
in the search table and puts them in a Collection
named colDocLinks
However the search results have on their table 10-Q
documents which i want to include but they also have different kind of animals which i do not want to include like 10-Q/A
documents...
How can i modify the loop so that it explicitly adds only 10-Q's with nothing attached to them in the collection and not others like 10-Q/A's?
Upvotes: 0
Views: 247
Reputation: 91
Public WithEvents objIE As InternetExplorer
Sub LaunchIE()
Set objIE = New InternetExplorer
objIE.Visible = True
objIE.Navigate "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=icld&type=10-Q%20&dateb=&owner=exclude&count=20"
End Sub
Private Sub objIE_DocumentComplete(ByVal pDisp As Object, URL As Variant)
Dim localIE As InternetExplorer
Set localIE = pDisp
Dim doc As MSHTML.IHTMLDocument3
Set doc = localIE.Document
Dim tdElements As MSHTML.IHTMLElementCollection
Dim td As MSHTML.IHTMLElement
Set tdElements = doc.getElementsByTagName("td")
For Each td In tdElements
If td.innerText = "10-Q" Then
Dim tr As MSHTML.IHTMLElement
Set tr = td.parentElement
Dim childrenElements As MSHTML.IHTMLElementCollection
Dim child As MSHTML.IHTMLElement
Set childrenElements = tr.Children
For Each child In childrenElements
If child.innerText = " Documents" Then
'Handle found element
End If
Next
End If
Next
End Sub
Upvotes: 1
Reputation: 901
I would use a regular expression to find and extract the exact links I was looking for. Something like this:
Dim RegEx As RegExp
Set RegEx = New RegExp
Dim match As match
With RegEx
.IgnoreCase = True
.Global = True
.MultiLine = True
End With
RegEx.Pattern = "<td nowrap="nowrap">10-Q</td>.+?<a href=""(.+?)\.htm"">"
For Each match In RegEx.Execute(Selection)
colDocLinks.Add match
Next
I didn't test the regular expression above, so it may need some adjustment. You would need to include a reference to Microsoft VBScript Regular Expressions 5.5 for this to work.
Upvotes: 0