Joe P.
Joe P.

Reputation: 547

Go Parse HTML table

I have a table in html that I would like to parse. Something like the one in the following http://sprunge.us/IJUC However, I'm not sure of a good way to parse out the information. I've seen a couple of html parsers, but those seem to require that everything has a special tag for you to parse it like info to grab; however, the majority of my info is within <td></td>

Does anyone have a suggestion for parsing this information out?

Upvotes: 8

Views: 9668

Answers (3)

Brad Rydzewski
Brad Rydzewski

Reputation: 2563

You may also be interested in Go's experimental HTML parser: https://code.google.com/p/go.net/html

The package definition according to the godoc:

Package html implements an HTML5-compliant tokenizer and parser

I haven't used it myself, but it seems pretty straight-forward:

Parsing is done by calling Parse with an io.Reader, which returns the root of the parse tree (the document element) as a *Node. It is the caller's responsibility to ensure that the Reader provides UTF-8 encoded HTML.

go get code.google.com/p/go.net/html

import "code.google.com/p/go.net/html"

doc, err := html.Parse(r)

It is not part of any current release, but can be used if you install from source, or use the golang-tip ubuntu apt repo.

EDIT: you can also use the following mirror of the experimental Go packages here: https://github.com/kless/go-exp

go get github.com/kless/go-exp/html

import (
    "github.com/kless/go-exp/html"
)

Upvotes: 2

mna
mna

Reputation: 24003

Shameless plug: My goquery library. It's the jQuery syntax brought to Go (requires Go's experimental html package, see instructions in the README of the library).

So you can do things like that (assuming your HTML document is loaded in doc, a *goquery.Document):

doc.Find("td").Each(func (i int, s *goquery.Selection) {
  fmt.Printf("Content of cell %d: %s\n", i, s.Text())
})

Edit: Change doc.Root.Find to doc.Find in the example since a goquery Document is now a Selection too (new in v0.2/master branch)

Upvotes: 16

sorcix
sorcix

Reputation: 64

If your HTML is well-formed, you can use the built-in XML parser:

http://golang.org/pkg/encoding/xml/

Upvotes: -1

Related Questions