maurizio
maurizio

Reputation: 11

extract text from a web page code vba excel

I have this example code where a method is used to extract text from a web page. In the complete code, these pages are then used to extract a whole series of data very quickly. The complete code has stopped working because this part used to extract only the text from the web page has stopped working correctly, or rather it still works if I use Excel 2010 while it has stopped working correctly for a couple of years on Excel 365, probably after some update. Instead of returning only the text, it returns a mixture of JavaScript, CSS, and text. Another strange thing is that it works regularly for an acquaintance of mine who works with Office 365 Enterprise. Do you see any possibility of making it work on Excel 365 as well, or is it an Excel bug? The example code to show only the non-functioning part is as follows:

Sub EstraiTestoDaPaginaWeb()
    Dim xmlHttp As Object
    Dim html As Object
    Dim testoEstratto As String
    Dim elem As Object

    ' Creazione dell'oggetto XMLHTTP
    Set xmlHttp = CreateObject("MSXML2.XMLHTTP.6.0")

    ' URL della pagina web
    Dim url As String
    url = "https://www.borsaitaliana.it/borsa/obbligazioni/mot/btp/scheda/IT0001086567.html?lang=it"

    ' Invio della richiesta
    xmlHttp.Open "GET", url, False
    xmlHttp.send

    ' Creazione dell'oggetto HTMLDocument
    Set html = CreateObject("HTMLFILE")
    html.body.innerHTML = xmlHttp.responseText

    ' Estrazione del testo
    testoEstratto = html.body.innerText

    ' Visualizzare il testo estratto
    MsgBox testoEstratto
End Sub

Excel 365 Results:

Excel 365

Excel 2010 Results:

Excel 2010

I don't know English and I use translators to communicate, so I'm afraid I might not have been clear. I'd like to try to simplify my question: is there a way to achieve the result shown in the attached images using Excel 365? And why do you think the sample code posted above continues to work with Excel 2010 but is no longer valid for today's Excel 365? It's the code that I don't understand and why it doesn't work anymore. You can also replace the borsaitaliana URL with any website you want.

Upvotes: 1

Views: 98

Answers (1)

Haluk
Haluk

Reputation: 1586

The web page you are using is populated from a JSON feed which can be extracted by the below code.

Then, you can parse the JSon response by using Regular Expressions or by using the JSON Parser given by Tim Hall https://github.com/VBA-tools/VBA-JSON

Sub Test()
    Dim xmlHttp As Object, URL As String, PayLoad As String, strJson As String

    Set xmlHttp = CreateObject("MSXML2.XMLHTTP")

    URL = "https://charts.borsaitaliana.it/charts/services/ChartWService.asmx/GetPrices"
    
    PayLoad = "{""request"":{""SampleTime"":""1mm"",""TimeFrame"":""1d"",""RequestedDataSetType"":""ohlc""," _
            & """ChartPriceType"":""price"",""Key"":""IT0001086567.MOT"",""OffSet"":0,""FromDate"":null," _
            & """ToDate"":null,""UseDelay"":false,""KeyType"":""Topic"",""KeyType2"":""Topic"",""Language"":""it-IT""}}"
    
    xmlHttp.Open "POST", URL, False
    xmlHttp.setRequestHeader "Content-Type", "application/json"
    xmlHttp.send PayLoad

    strJson = xmlHttp.responseText

    MsgBox strJson
End Sub

EDIT:

This is the Regular Expressions approach;

Sub Test2()
    Dim xmlHttp As Object, URL As String, PayLoad As String
    Dim strJSON As String, regExp As Object, RetVal As Object, icount As Integer, i As Integer, r As Integer
    
    Range("A2:F" & Rows.Count) = ""
    
    Set xmlHttp = CreateObject("MSXML2.ServerXMLHTTP")

    URL = "https://charts.borsaitaliana.it/charts/services/ChartWService.asmx/GetPrices"
    
    PayLoad = "{""request"":{""SampleTime"":""1mm"",""TimeFrame"":""1d"",""RequestedDataSetType"":""ohlc""," _
            & """ChartPriceType"":""price"",""Key"":""IT0001086567.MOT"",""OffSet"":0,""FromDate"":null," _
            & """ToDate"":null,""UseDelay"":false,""KeyType"":""Topic"",""KeyType2"":""Topic"",""Language"":""it-IT""}}"
    
    xmlHttp.Open "POST", URL, False
    xmlHttp.setRequestHeader "Content-Type", "application/json"
    xmlHttp.send PayLoad

    strJSON = xmlHttp.responseText
    
    Set regExp = CreateObject("VBScript.RegExp")
    regExp.Pattern = "([\d+,.]+)"
    regExp.Global = True
    regExp.IgnoreCase = True
    
    Set RetVal = regExp.Execute(strJSON)
    
    icount = RetVal.Count
    r = 2
    
    For i = 0 To icount - 1 Step 2
        Range("A" & r).Resize(1, 6) = Split(regExp.Execute(strJSON)(i).SubMatches(0), ",")
        Range("A" & r) = Format((Range("A" & r) / 1000 / 60 / 60 / 24) + DateSerial(1970, 1, 1), "dd.mm.yyyy hh:mm:ss")
        r = r + 1
    Next
End Sub

Upvotes: 2

Related Questions