Vega
Vega

Reputation: 2929

How can I scrape values from embedded Javascript in HTML?

I need to parse some values out of embedded Javascript in a webpage. I tried to tokenize the HTML with something like this but it doesn't tokenize the Javascript part.

func CheckSitegroup(httpBody io.Reader) []string {
    sitegroups := make([]string, 0)
    page := html.NewTokenizer(httpBody)
    for {
        tokenType := page.Next()
        fmt.Println("TokenType:", tokenType)
        // check if HTML file has ended
        if tokenType == html.ErrorToken {
            return sitegroups
        }
        token := page.Token()
        fmt.Println("Token:", token)
        if tokenType == html.StartTagToken && token.DataAtom.String() == "script" {
            for _, attr := range token.Attr {
                fmt.Println("ATTR.KEY:", attr.Key)
                sitegroups = append(sitegroups, attr.Val)
            }
        }
    }
}

The Script in the HTML-body looks like this and I need the campaign number (nil / "" if there is no number or if there is no test.campaign = at all - same goes for the sitegroup). Is there an easy way to get the information? I thought about regular expressions but maybe there is something else? Never worked with regex.

<script type="text/javascript" >
    var test = {};
    test.campaign = "8d26113ba";
    test.isTest = "false";
    test.sitegroup = "Homepage";
</script>

Upvotes: 3

Views: 1341

Answers (2)

siongui
siongui

Reputation: 101

The Go standard strings library comes with a lot of useful functions which you can use to parse the JavaScript code to get campaign number you need.

The following code can get the campaign number from the js code provided in your question (Run code on Go Playground):

package main

import (
    "bufio"
    "fmt"
    "os"
    "strings"
)

const js = `                                                                    
<script type="text/javascript" >                                                
    var test = {};                                                              
    test.campaign = "8d26113ba";                                                
    test.isTest = "false";                                                      
    test.sitegroup = "Homepage";                                                
</script>                                                                       
`

func StringToLines(s string) []string {
    var lines []string

    scanner := bufio.NewScanner(strings.NewReader(s))
    for scanner.Scan() {
        lines = append(lines, scanner.Text())
    }

    if err := scanner.Err(); err != nil {
        fmt.Fprintln(os.Stderr, "reading standard input:", err)
    }

    return lines
}

func getCampaignNumber(line string) string {
    tmp := strings.Split(line, "=")[1]
    tmp = strings.TrimSpace(tmp)
    tmp = tmp[1 : len(tmp)-2]
    return tmp
}

func main() {
    lines := StringToLines(js)
    for _, line := range lines {
        if strings.Contains(line, "campaign") {
            result := getCampaignNumber(line)
            println(result)
        }
    }
}

Upvotes: 0

Adam Vincze
Adam Vincze

Reputation: 871

first you need to get the JS code safely. The easiest way would be with the goquery lib: https://github.com/PuerkitoBio/goquery

after that you need to get the variables safely. Depending on how complicated it gets you could either parse the real JS Abstract syntax tree and look for the right variables for example with the excellent JS interpreter in GO: http://godoc.org/github.com/robertkrimen/otto/parser

or as you mentioned in the case mentioned above regex would be really easy. There is a really nice tutorial on regexes in go : https://github.com/StefanSchroeder/Golang-Regex-Tutorial

Upvotes: 2

Related Questions