Reputation: 5157
I am learning google's Go programming language. Does anyone know the best practice to extract all URLs from a html web page?
Coming from the Java world, there are libraries to do the job, for example jsoup , htmlparser, etc. But for go lang, I guess no available similar library was made yet?
Upvotes: 27
Views: 35836
Reputation: 1
also you may use "Colly" (documentations), it usually use for web scrapping
Features
import (
"fmt"
"github.com/gocolly/colly"
)
func main() {
c := colly.NewCollector()
// Find and visit all links
c.OnHTML("a", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://go-colly.org/")
}
Upvotes: 0
Reputation: 5945
I just published an open source event-based HTML 5.0 compliant parsing package for Go. You can find it here
Here is the sample code to get all the links from a page (from A elements):
links := make([]string)
parser := NewParser(htmlContent)
parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
if e.TagName == "link" {
link,_ := e.GetAttributeValue("href")
if(link != "") {
links = appends(links, link)
}
}
}, nil)
A few things to keep in mind:
Upvotes: 0
Reputation: 1330102
While the Go package for HTML parsing is indeed still in progress, it is available in the go.net repository.
Its sources are at code.google.com/p/go.net/html
github.com/golang/net
and it is being actively developed.
It is mentioned in this recent go-nuts discussion.
Note that with Go 1.4 (Dec 2014), as I mentioned in this answer, the package is now golang.org/x/net
(see godoc).
Upvotes: 17
Reputation: 23789
If you know jQuery, you'll love GoQuery.
Honestly, it's the easiest, most powerful HTML utility I've found in Go, and it's based off of the html package in the go.net repository. (Okay, so it's higher-level than just a parser as it doesn't expose raw HTML tokens and the like, but if you want to actually get anything done with an HTML document, this package will help.)
Upvotes: 26
Reputation: 11469
I've searched around and found that there are is a library called Gokogiri which sounds alike Nogokiri for Ruby. I think the project is active too.
Upvotes: 6
Reputation: 28385
Go's standard package for HTML parsing is still a work in progress and is not part of the current release. A third party package you might try though is go-html-transform. It is being actively maintained.
Upvotes: 21