user2288856
user2288856

Reputation: 49

How can i get the content of an html.Node

I would like to get data from a URL using the GO 3rd party library from http://godoc.org/code.google.com/p/go.net/html . But I came across a problem, that is I couldn't get the content of an html.Node.

There's an example code in the reference document, and here's the code.

s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>`
doc, err := html.Parse(strings.NewReader(s))
if err != nil {
    log.Fatal(err)
}
var f func(*html.Node)
f = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        for _, a := range n.Attr {
            if a.Key == "href" {
                fmt.Println(a.Val)
                break
            }
        }
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        f(c)
    }
}
f(doc)

The output is:

foo
/bar/baz

If I want to get

Foo
BarBaz

What should I do?

Upvotes: 4

Views: 4509

Answers (1)

tux21b
tux21b

Reputation: 94849

The tree of <a href="link"><strong>Foo</strong>Bar</a> looks basically like this:

  • ElementNode "a" (this node also includes a list off attributes)
    • ElementNode "strong"
      • TextNode "Foo"
    • TextNode "Bar"

So, assuming that you want to get the plain text of the link (e.g. FooBar) you would have to walk trough the tree and collect all text nodes. For example:

func collectText(n *html.Node, buf *bytes.Buffer) {
    if n.Type == html.TextNode {
        buf.WriteString(n.Data)
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        collectText(c, buf)
    }
}

And the changes in your function:

var f func(*html.Node)
f = func(n *html.Node) {
    if n.Type == html.ElementNode && n.Data == "a" {
        text := &bytes.Buffer{}
        collectText(n, text)
        fmt.Println(text)
    }
    for c := n.FirstChild; c != nil; c = c.NextSibling {
        f(c)
    }
}

Upvotes: 11

Related Questions