MITHU
MITHU

Reputation: 154

Scraping data from multiple sites using vba

I've written a code to parse data from multiple sites using vba via google. It can scrape the tag element ["h3" and "a"] irrespective of whatever the search is. However, I would like to parse the name or phone number of any search if what I'm expecting is possible. Thanks in advance for the help.

Sub GoogleSearch()
Dim http As New MSXML2.XMLHTTP60, html As New HTMLDocument, hmm As New HTMLDocument
Dim topics As Object, post As Object, link As Object, posts As Object
Dim url As String, z As String
Dim i As Long, LRow As Long

LRow = Range("A" & Rows.Count).End(xlUp).Row

For i = 3 To LRow
url = "https://www.google.co.in/search?q=" & Cells(i, 1)

http.Open "GET", url, False
http.setRequestHeader "Content-Type", "text/xml"
http.send

html.body.innerHTML = http.responseText

Set topics = html.getElementById("rso")
Set post = topics.getElementsByTagName("H3")(0)
Set link = post.getElementsByTagName("a")(0)

Cells(i, 2) = link.innerText
Cells(i, 3) = link.href
z = link.href
    http.Open "GET", z, False
    http.send
    hmm.body.innerHTML = http.responseText
    Set posts = hmm.getElementsByClassName("phone")
    If Not posts(0) Is Nothing Then
        Cells(i, 4) = posts(0).innerText
    Else    
        Cells(i, 4).Value = "Phone Not Found"
    End If
Next i
End Sub

Upvotes: 1

Views: 922

Answers (1)

David Zemens
David Zemens

Reputation: 53623

I've written a code to parse data from multiple sites using google.

Not exactly. What you've written is very specific and inflexible code which requires there will be a "phone" class, else there will be an error. Once you do this http.Open "GET", z, False you're navigating to a new website, and there is no reason to believe that say Yellopages and McDonalds would share even remotely the same structure. At this point, what you're doing is simply not flexible enough to handle the infinite structures that (different) websites may take:

http.send
hmm.body.innerHTML = http.responseText
Set posts = hmm.getElementsByClassName("phone")
Cells(i, 4) = posts(0).innerText

When you Set posts = hmm... you are assuming that there must be an element with ClassName= "phone" on every website you scrape. if there is no such element, then an error will raise (object required).

In order to get any element, you need to know something about the structure of the site you are scraping, but there are several ways you can do it:

  1. you can use an indexed call to getElementsByTagName or getElementsByClassName (what you're currently doing), but this requires you know the tag name or class name
  2. Matching some condition while you iterate over the collection returned by getElementsByTagName with an Exit For if your condition is found
  3. If your browser supports it, you can use the getElementByID (if ID tag is available, it is a unique identifier)
  4. you can provide the xpath identifier for the element if known.
  5. You could try ingesting the entire result and using Regular Expressions to extract the phone numbers from the plain HTML source, and that may be the most reliable in this case.

If you provide sample source code of the webpage being parsed, it may be easier to give you specific assistance, however, since you are apparently trying to scrape various websites, it is unlikely that a single solution can be obtained.

What I might do, following-up on your comment is something like this:

    http.Open "GET", z, False
    http.send
    hmm.body.innerHTML = http.responseText

    Set posts = hmm.getElementsByClassName("phone")
    If posts Is Nothing Then'# This condition is True if no "phone" element exist
        cells(1,4).Value = "phone not found at " & z
    Else
        Cells(i, 4) = posts(0).innerText
    End If

Next i

This will at least run without error, and it will identify the URLs which do not contain the "phone" class. You can then examine the source HTML of those web pages and you can modify your code to handle additional cases.

You could try ingesting the entire result and using Regular Expressions to extract the phone numbers from the plain HTML source, and that may be the most reliable in this case.

Upvotes: 1

Related Questions