Reputation: 154
I've written a code to parse data from multiple sites using vba via google. It can scrape the tag element ["h3" and "a"] irrespective of whatever the search is. However, I would like to parse the name or phone number of any search if what I'm expecting is possible. Thanks in advance for the help.
Sub GoogleSearch()
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, hmm As New HTMLDocument
Dim topics As Object, post As Object, link As Object, posts As Object
Dim url As String, z As String
Dim i As Long, LRow As Long
LRow = Range("A" & Rows.Count).End(xlUp).Row
For i = 3 To LRow
url = "https://www.google.co.in/search?q=" & Cells(i, 1)
http.Open "GET", url, False
http.setRequestHeader "Content-Type", "text/xml"
http.send
html.body.innerHTML = http.responseText
Set topics = html.getElementById("rso")
Set post = topics.getElementsByTagName("H3")(0)
Set link = post.getElementsByTagName("a")(0)
Cells(i, 2) = link.innerText
Cells(i, 3) = link.href
z = link.href
http.Open "GET", z, False
http.send
hmm.body.innerHTML = http.responseText
Set posts = hmm.getElementsByClassName("phone")
If Not posts(0) Is Nothing Then
Cells(i, 4) = posts(0).innerText
Else
Cells(i, 4).Value = "Phone Not Found"
End If
Next i
End Sub
Upvotes: 1
Views: 922
Reputation: 53623
I've written a code to parse data from multiple sites using google.
Not exactly. What you've written is very specific and inflexible code which requires there will be a "phone" class, else there will be an error. Once you do this http.Open "GET", z, False
you're navigating to a new website, and there is no reason to believe that say Yellopages and McDonalds would share even remotely the same structure. At this point, what you're doing is simply not flexible enough to handle the infinite structures that (different) websites may take:
http.send
hmm.body.innerHTML = http.responseText
Set posts = hmm.getElementsByClassName("phone")
Cells(i, 4) = posts(0).innerText
When you Set posts = hmm...
you are assuming that there must be an element with ClassName= "phone" on every website you scrape. if there is no such element, then an error will raise (object required).
In order to get any element, you need to know something about the structure of the site you are scraping, but there are several ways you can do it:
getElementsByTagName
or getElementsByClassName
(what you're currently doing), but this requires you know the tag name or class namegetElementsByTagName
with an Exit For
if your condition is foundgetElementByID
(if ID tag is available, it is a unique identifier)xpath
identifier for the element if known.If you provide sample source code of the webpage being parsed, it may be easier to give you specific assistance, however, since you are apparently trying to scrape various websites, it is unlikely that a single solution can be obtained.
What I might do, following-up on your comment is something like this:
http.Open "GET", z, False
http.send
hmm.body.innerHTML = http.responseText
Set posts = hmm.getElementsByClassName("phone")
If posts Is Nothing Then'# This condition is True if no "phone" element exist
cells(1,4).Value = "phone not found at " & z
Else
Cells(i, 4) = posts(0).innerText
End If
Next i
This will at least run without error, and it will identify the URLs which do not contain the "phone" class. You can then examine the source HTML of those web pages and you can modify your code to handle additional cases.
You could try ingesting the entire result and using Regular Expressions to extract the phone numbers from the plain HTML source, and that may be the most reliable in this case.
Upvotes: 1