Reputation: 1531
I'm quite new to Go and I'm struggling a little at the moment with parsing some html.
The HTML looks like:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<div>something</div>
<div id="publication">
<div>I want <span>this</span></div>
</div>
<div>
<div>not this</div>
</div>
</body>
</html>
And I want to get this as a string:
<div>I want <span>this</span></div>
I've tried html.NewTokenizer() (from golang.org/x/net/html) but can't seem to get the entire contents of an element back from a token or node. I've also tried using depth with this but it picked up other bits of code.
I've also had a go with goquery which seems perfect, code:
doc, err := goquery.NewDocument("{url}")
if err != nil {
log.Fatal(err)
}
doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
fmt.Printf("Review %d: %s\n", i, s.Html())
})
But s.Text() will only print out the text and s.Html() doesn't seem to exist (?).
I think parsing it as XML would work, except the actual HTML is very deep and there would have to be a struct for each parent element...
Any help would be amazing!
Upvotes: 2
Views: 4537
Reputation: 169
You're not getting the result (s.Html() actually exist), because you haven't set the variable and error handler.
Please add this to your code and it will work fine:
doc.Find("#publication").Each(func(i int, s *goquery.Selection) {
inside_html,_ := s.Html() //underscore is an error
fmt.Printf("Review %d: %s\n", i, inside_html)
})
Upvotes: 2