Tushar
Tushar

Reputation: 97

Fetching data from web page

I am trying to fetch the publication date corresponding to each patent number.

Here is the Excel sheet:

enter image description here

The database is espacenet.com

Here's the link for the first patent you see in the Excel sheet:
http://worldwide.espacenet.com/searchResults?compact=false&PN=US7055777B2&ST=advanced&locale=en_EP&DB=EPODOC

Under the "Publication Info" header, I need to get the date after matching the patent number with the one in the Excel sheet.

Here's the code:

Sub tryextraction()

Dim ie As New InternetExplorer
Dim sdd As String
Dim tdd() As String
Dim num0 As Integer
Dim num1 As Integer
Dim doc As HTMLDocument
Dim i As Integer
Dim j As Integer

ie.Visible = True

num1 = ActiveSheet.UsedRange.Rows.Count

For num0 = 2 To num1
     ie.navigate "http://worldwide.espacenet.com/searchResults?compact=false&PN=" & Range("A" & num0) & "&ST=advanced&locale=en_EP&DB=EPODOC"

    Do
        DoEvents
    Loop Until ie.readyState = READYSTATE_COMPLETE

    Set doc = ie.document
    sdd = Trim(doc.getElementsByTagName("td")(5).innerText)
    tdd() = Split(sdd, vbLf)
    j = UBound(tdd)

    For i = 0 To j
        If InStr(tdd(i), "(") <> 0 Then
            tdd(i) = Replace(tdd(i), " ", "")
            tdd(i) = Replace(tdd(i), "(", "")
            tdd(i) = Replace(tdd(i), ")", "")

            If tdd(i) = Range("A" & num0).Value Then
                Range("B" & num0).Value = tdd(i + 1)
            End If
        End If
    Next i
Next num0
ie.Quit

End Sub

The code is not giving any error. The column "Publication Date" remains blank after the code finishes running.

The html tag which contains the publication info has been taken correctly.

Upvotes: 2

Views: 473

Answers (2)

QHarr
QHarr

Reputation: 84465

There are often multiple publication dates under the publication info header.

Example:

Example


The following script obtains all of these and the preceeding line (so you have the associated publication along with date).

It loops from row 2 of the Activesheet, to the last populated row, picking up the Publication Numbers from column A and writing out the results starting from column B. Depending on how many dates there are, the data will extend across multiple columns from B.


Regex:

A regex of ^(.*)\s\d{4}-\d{2}-\d{2} is used to retrieve the date pattern and the preceeding line i.e. The publication identifier and the date. Try it

Regex matches


Example output:

Example output


VBA:

Option Explicit

Public Sub GetInfo()
    Dim IE As New InternetExplorer, html As New HTMLDocument, url As String, pubInfo As Object
    Dim loopRange As Range, iRow As Range, counter As Long
    'example US7055777B2
    Application.ScreenUpdating = False
    With ActiveSheet
        Set loopRange = Range("A2:A" & .Cells(.Rows.Count, "A").End(xlUp).Row)
    End With

    With IE
        .Visible = True
        counter = 2                              '<== start from row 2
        For Each iRow In loopRange
            If Not IsEmpty(iRow) Then
                url = "https://worldwide.espacenet.com/searchResults?compact=false&PN=" & iRow.Value & "&ST=advanced&locale=en_EP&DB=EPODOC"
                .navigate url

                While .Busy Or .readyState < 4: DoEvents: Wend
                Set html = .document
                Do
                    DoEvents
                    On Error Resume Next
                    Set pubInfo = html.querySelector(".publicationInfoColumn")
                    On Error GoTo 0
                Loop While pubInfo Is Nothing

                Dim tempArr()
                tempArr = GetDateAndPatent(pubInfo.innerText, "^(.*)\s\d{4}-\d{2}-\d{2}") '"(?m)^(.*)\s\d{4}-\d{2}-\d{2}" '<==This is not supported

                With ActiveSheet
                    .Cells(counter, 2).Resize(1, UBound(tempArr) + 1) = tempArr
                End With
            End If
            counter = counter + 1
        Next iRow
        .Quit                                    '<== Remember to quit application
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetDateAndPatent(ByVal inputString As String, ByVal sPattern As String) As Variant
    Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long

    With CreateObject("vbscript.regexp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern
        If .test(inputString) Then
            Set matches = .Execute(inputString)
            For Each iMatch In matches
                ReDim Preserve arrMatches(i)
                arrMatches(i) = iMatch.Value
                i = i + 1
            Next iMatch
        End If
    End With
    GetDateAndPatent = arrMatches
End Function

Upvotes: 0

Denise Skidmore
Denise Skidmore

Reputation: 2416

There are some trailing white space characters after the ID you are searching for in the document so tdd(i) = Range("A" & num0).Value never evaluates to true. It's not just a space, so a simple Trim(tdd(i)) = Range("A" & num0).Value call does not help. Try instead InStr(tdd(i), Range("A" & num0).Value) If that is not good enough, you'll have to specifically remove CRLF from the end of the string before doing the compare.

Upvotes: 2

Related Questions