user2737876
user2737876

Reputation: 1168

Golang parse HTML, extract all content with <body> </body> tags

I am needing to return all of the content within the body tags of an HTML document, including any subsequent HTML tags, etc. I'm curious to know what the best way to go about this is. I had a working solution with the Gokogiri package, however I am trying to stay away from any packages that depend on C libraries. Is there a way to accomplish this with the go standard library? or with a package that is 100% Go?

Since posting my original question I have attempted to use the following packages that have yielded no resolution. (Neither of which seem to return subsequent children or nested tags from inside the body. For example:

<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>

will return body content, ignoring the subsequent <p> tags and the text they wrap):

The overall goal would be to obtain a string or content that would look like:

<body>
    body content 
    <p>more content</p>
</body>

Upvotes: 44

Views: 97705

Answers (5)

Pejman Khaleghi
Pejman Khaleghi

Reputation: 30

I have written my Golang HTML DOM parser with pure Golang and no dependency used in it. This project is still under development. But now it is usable, and I will develop it in the future.

Project link: https://github.com/pejman-hkh/gdp/

Go Playground: https://go.dev/play/p/ksyPWPcDq2J

package main

import (
    "fmt"

    "github.com/pejman-hkh/gdp/gdp"
)

func main() {
    document := gdp.Default(`<!DOCTYPE html>
    <html>
        <head>
            <title>
                Title of the document
            </title>
        </head>
        <body>
            body content 
            <p>more content</p>
        </body>
    </html>`)

    body := document.Find("body").Eq(0)
    fmt.Print(body.OuterHtml())
}

Upvotes: -1

andybalholm
andybalholm

Reputation: 16170

Since you didn't show the source code of your attempt with the html package, I'll have to guess what you were doing, but I suspect you were using the tokenizer rather than the parser. Here is a program that uses the parser and does what you were looking for:

package main

import (
    "log"
    "os"
    "strings"

    "github.com/andybalholm/cascadia"
    "golang.org/x/net/html"
)

func main() {
    r := strings.NewReader(`<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>`)
    doc, err := html.Parse(r)
    if err != nil {
        log.Fatal(err)
    }

    body := cascadia.MustCompile("body").MatchFirst(doc)
    html.Render(os.Stdout, body)
}

Upvotes: 10

Joachim Birche
Joachim Birche

Reputation: 781

This can be solved by recursively finding the body node, using the html package, and subsequently render the html, starting from that node.

package main

import (
    "bytes"
    "errors"
    "fmt"
    "golang.org/x/net/html"
    "io"
    "strings"
)

func Body(doc *html.Node) (*html.Node, error) {
    var body *html.Node
    var crawler func(*html.Node)
    crawler = func(node *html.Node) {
        if node.Type == html.ElementNode && node.Data == "body" {
            body = node
            return
        }
        for child := node.FirstChild; child != nil; child = child.NextSibling {
            crawler(child)
        }
    }
    crawler(doc)
    if body != nil {
        return body, nil
    }
    return nil, errors.New("Missing <body> in the node tree")
}

func renderNode(n *html.Node) string {
    var buf bytes.Buffer
    w := io.Writer(&buf)
    html.Render(w, n)
    return buf.String()
}

func main() {
    doc, _ := html.Parse(strings.NewReader(htm))
    bn, err := Body(doc)
    if err != nil {
        return
    }
    body := renderNode(bn)
    fmt.Println(body)
}

const htm = `<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
    body content
    <p>more content</p>
</body>
</html>`

Upvotes: 70

fredrik
fredrik

Reputation: 13550

It can be done using the standard encoding/xml package. But it's a bit cumbersome. And one caveat in this example is that it will not include the enclosing body tag, but it will contain all of it's children.

package main

import (
    "bytes"
    "encoding/xml"
    "fmt"
)

type html struct {
    Body body `xml:"body"`
}
type body struct {
    Content string `xml:",innerxml"`
}

func main() {
    b := []byte(`<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content 
        <p>more content</p>
    </body>
</html>`)

    h := html{}
    err := xml.NewDecoder(bytes.NewBuffer(b)).Decode(&h)
    if err != nil {
        fmt.Println("error", err)
        return
    }

    fmt.Println(h.Body.Content)
}

Runnable example:
http://play.golang.org/p/ZH5iKyjRQp

Upvotes: 11

Caleb
Caleb

Reputation: 9478

You could also do this purely with strings:

func main() {
    r := strings.NewReader(`
<!DOCTYPE html>
<html>
    <head>
        <title>
            Title of the document
        </title>
    </head>
    <body>
        body content
        <p>more content</p>
    </body>
</html>
`)
    str := NewSkipTillReader(r, []byte("<body>"))
    rtr := NewReadTillReader(str, []byte("</body>"))
    bs, err := ioutil.ReadAll(rtr)
    fmt.Println(string(bs), err)
}

The definitions for SkipTillReader and ReadTillReader are here: https://play.golang.org/p/6THLhRgLOa. (But basically skip until you see the delimiter and then read until you see the delimiter)

This won't work for case insensitivity (though that wouldn't be hard to change).

Upvotes: 2

Related Questions