Jifeng Zhang
Jifeng Zhang

Reputation: 5157

Extract links from a web page using Go lang

I am learning google's Go programming language. Does anyone know the best practice to extract all URLs from a html web page?

Coming from the Java world, there are libraries to do the job, for example jsoup , htmlparser, etc. But for go lang, I guess no available similar library was made yet?

Upvotes: 27

Views: 35836

Answers (6)

Stanimir Berdinskih
Stanimir Berdinskih

Reputation: 1

also you may use "Colly" (documentations), it usually use for web scrapping

Features

  1. Clean API
  2. Fast (>1k request/sec on a single core)
  3. Manages request delays and maximum concurrency per domain
  4. Automatic cookie and session handling
  5. Sync/async/parallel scraping
  6. Distributed scraping
  7. Caching
  8. Automatic encoding of non-unicode responses
  9. Robots.txt support
  10. Google App Engine support
import (
   "fmt"
   "github.com/gocolly/colly"
)

func main() {
   c := colly.NewCollector()
 
   // Find and visit all links
   c.OnHTML("a", func(e *colly.HTMLElement) {
     e.Request.Visit(e.Attr("href"))
   })
 
   c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
   })

   c.Visit("http://go-colly.org/")
}
 

Upvotes: 0

Marcelo Calbucci
Marcelo Calbucci

Reputation: 5945

I just published an open source event-based HTML 5.0 compliant parsing package for Go. You can find it here

Here is the sample code to get all the links from a page (from A elements):

links := make([]string)

parser := NewParser(htmlContent)

parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
    if e.TagName == "link" {
        link,_ := e.GetAttributeValue("href")
        if(link != "") {
            links = appends(links, link)
        } 
    }
}, nil)

A few things to keep in mind:

  • These are relative links, not full URLs
  • Dynamically generated links will not be collected
  • There are other links not being collected (META tags, images, iframes, etc.). It's pretty easy to modify this code to collect those.

Upvotes: 0

VonC
VonC

Reputation: 1330102

While the Go package for HTML parsing is indeed still in progress, it is available in the go.net repository.

Its sources are at code.google.com/p/go.net/html github.com/golang/net and it is being actively developed.

It is mentioned in this recent go-nuts discussion.


Note that with Go 1.4 (Dec 2014), as I mentioned in this answer, the package is now golang.org/x/net (see godoc).

Upvotes: 17

Matt
Matt

Reputation: 23789

If you know jQuery, you'll love GoQuery.

Honestly, it's the easiest, most powerful HTML utility I've found in Go, and it's based off of the html package in the go.net repository. (Okay, so it's higher-level than just a parser as it doesn't expose raw HTML tokens and the like, but if you want to actually get anything done with an HTML document, this package will help.)

Upvotes: 26

Ye Lin Aung
Ye Lin Aung

Reputation: 11469

I've searched around and found that there are is a library called Gokogiri which sounds alike Nogokiri for Ruby. I think the project is active too.

Upvotes: 6

Sonia
Sonia

Reputation: 28385

Go's standard package for HTML parsing is still a work in progress and is not part of the current release. A third party package you might try though is go-html-transform. It is being actively maintained.

Upvotes: 21

Related Questions