How to get the contents of a HTML element

Question

I'm quite new to Go and I'm struggling a little at the moment with parsing some html.

The HTML looks like:




    



    something

    
        I want this
    

    
        not this

And I want to get this as a string:

I want this

I've tried html.NewTokenizer() (from golang.org/x/net/html) but can't seem to get the entire contents of an element back from a token or node. I've also tried using depth with this but it picked up other bits of code.

I've also had a go with goquery which seems perfect, code:

doc, err := goquery.NewDocument("{url}")
if err != nil {
    log.Fatal(err)
}

doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
    fmt.Printf("Review %d: %s
", i, s.Html())
})

But s.Text() will only print out the text and s.Html() doesn't seem to exist (?).

I think parsing it as XML would work, except the actual HTML is very deep and there would have to be a struct for each parent element...

Any help would be amazing!

Mikhail Andrianov · Accepted Answer

You're not getting the result (s.Html() actually exist), because you haven't set the variable and error handler.

Please add this to your code and it will work fine:

doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
    inside_html,_ := s.Html() //underscore is an error
    fmt.Printf("Review %d: %s
", i, inside_html)
})

How to get the contents of a HTML element

Answers (1)

Related Questions