Reputation: 551
I am implementing a web crawler and I have a Parse
function that takes an link as an input and should return all links contained in the page.
I would like to make the most of go routines to make it as fast as possible. To do so, I want to create a pool of workers.
I set up a channel of strings representing the links links := make(chan string)
and pass it as an argument to the Parse
function. I want the workers to communicate through a unique channel. When the function starts, it takes a link from links
, parse it and **for each valid link found in the page, add the link to links
.
func Parse(links chan string) {
l := <- links
// If link already parsed, return
for url := newUrlFounds {
links <- url
}
}
However, the main issue here is to indicate when no more links have been found. One way I thought of doing it was to wait before all workers have completed. But I don't know how to do so in Go.
Upvotes: 2
Views: 1069
Reputation: 31691
As Tim already commented, don't use the same channel for reading and writing in a worker. This will deadlock eventually (even if buffered, because Murphy).
A far simpler design is simply launching one goroutine per URL. A buffered channel can serve as a simple semaphore to limit the number of concurrent parsers (goroutines that don't do anything because they are blocked are usually negligible). Use a sync.WaitGroup to wait until all work is done.
package main
import (
"sync"
)
func main() {
sem := make(chan struct{}, 10) // allow ten concurrent parsers
wg := &sync.WaitGroup{}
wg.Add(1)
Parse("http://example.com", sem, wg)
wg.Wait()
// all done
}
func Parse(u string, sem chan struct{}, wg *sync.WaitGroup) {
defer wg.Done()
sem <- struct{}{} // grab
defer func() { <-sem }() // release
// If URL already parsed, return.
var newURLs []string
// ...
for u := range newURLs {
wg.Add(1)
go Parse(u)
}
}
Upvotes: 3