How can i get the content of an html.Node

Question

I would like to get data from a URL using the GO 3rd party library from http://godoc.org/code.google.com/p/go.net/html . But I came across a problem, that is I couldn't get the content of an html.Node.

There's an example code in the reference document, and here's the code.

s := `Links:
Foo
BarBaz`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
    log.Fatal(err)
}
var f func(*html.Node)
f = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        for _, a := range n.Attr {
            if a.Key == "href" {
                fmt.Println(a.Val)
                break
            }
        }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        f(c)
    }
}
f(doc)

The output is:

foo
/bar/baz

If I want to get

Foo
BarBaz

What should I do?

tux21b · Accepted Answer

The tree of FooBar looks basically like this:

ElementNode "a" (this node also includes a list off attributes)
- ElementNode "strong"
  - TextNode "Foo"
- TextNode "Bar"

So, assuming that you want to get the plain text of the link (e.g. FooBar) you would have to walk trough the tree and collect all text nodes. For example:

func collectText(n *html.Node, buf *bytes.Buffer) {
    if n.Type == html.TextNode {
        buf.WriteString(n.Data)
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        collectText(c, buf)
    }
}

And the changes in your function:

var f func(*html.Node)
f = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        text := &bytes.Buffer{}
        collectText(n, text)
        fmt.Println(text)
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        f(c)
    }
}

How can i get the content of an html.Node

Answers (1)

Related Questions