Reputation: 5157

Extract links from a web page using Go lang

I am learning google's Go programming language. Does anyone know the best practice to extract all URLs from a html web page?

Coming from the Java world, there are libraries to do the job, for example jsoup , htmlparser, etc. But for go lang, I guess no available similar library was made yet?

Upvotes: 27

Answers (6)

Stanimir Berdinskih

Reputation: 1

also you may use "Colly" (documentations), it usually use for web scrapping

Features

Clean API
Fast (>1k request/sec on a single core)
Manages request delays and maximum concurrency per domain
Automatic cookie and session handling
Sync/async/parallel scraping
Distributed scraping
Caching
Automatic encoding of non-unicode responses
Robots.txt support
Google App Engine support

import (
   "fmt"
   "github.com/gocolly/colly"
)

func main() {
   c := colly.NewCollector()
 
   // Find and visit all links
   c.OnHTML("a", func(e *colly.HTMLElement) {
     e.Request.Visit(e.Attr("href"))
   })
 
   c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
   })

   c.Visit("http://go-colly.org/")
}

Upvotes: 0

Marcelo Calbucci

Reputation: 5945

I just published an open source event-based HTML 5.0 compliant parsing package for Go. You can find it here

Here is the sample code to get all the links from a page (from A elements):

links := make([]string)

parser := NewParser(htmlContent)

parser.Parse(nil, func(e *HtmlElement, isEmpty bool) {
    if e.TagName == "link" {
        link,_ := e.GetAttributeValue("href")
        if(link != "") {
            links = appends(links, link)
        } 
    }
}, nil)

A few things to keep in mind:

These are relative links, not full URLs
Dynamically generated links will not be collected
There are other links not being collected (META tags, images, iframes, etc.). It's pretty easy to modify this code to collect those.

Upvotes: 0

VonC

Reputation: 1330102

While the Go package for HTML parsing is indeed still in progress, it is available in the go.net repository.

Its sources are at ~~code.google.com/p/go.net/html~~ github.com/golang/net and it is being actively developed.

It is mentioned in this recent go-nuts discussion.

Note that with Go 1.4 (Dec 2014), as I mentioned in this answer, the package is now golang.org/x/net (see godoc).

Upvotes: 17

Matt

Reputation: 23789

If you know jQuery, you'll love GoQuery.

Honestly, it's the easiest, most powerful HTML utility I've found in Go, and it's based off of the html package in the go.net repository. (Okay, so it's higher-level than just a parser as it doesn't expose raw HTML tokens and the like, but if you want to actually get anything done with an HTML document, this package will help.)

Upvotes: 26

Ye Lin Aung

Reputation: 11469

I've searched around and found that there are is a library called Gokogiri which sounds alike Nogokiri for Ruby. I think the project is active too.

Upvotes: 6

Sonia

Reputation: 28385

Go's standard package for HTML parsing is still a work in progress and is not part of the current release. A third party package you might try though is go-html-transform. It is being actively maintained.

Upvotes: 21

Extract links from a web page using Go lang

Answers (6)

Related Questions