Molenpad
Molenpad

Reputation: 1044

Ignore external links in go web crawler

I'm really new to go, and I'm playing with it at the moment by building a simple web crawler following this tutorial: https://jdanger.com/build-a-web-crawler-in-go.html

It's broken down really nicely, but I want to put something in place so that the only links which are enqueued are part of the main domain, and not external.

So let's say I'm crawling https://www.mywebsite.com, I only want to include things like https://www.mywebsite.com/about-us or https://www.mywebsite.com/contact - I don't want subdomains, such as https://subdomain.mywebsite.com or external links found like https://www.facebook.com as I do not want the crawler to fall into a black hole.

Looking at the code, I think I need to make the change to this function which fixes relative links:

func fixUrl(href, base string) (string) {  // given a relative link and the page on
  uri, err := url.Parse(href)              // which it's found we can parse them
  if err != nil {                          // both and use the url package's
    return ""                              // ResolveReference function to figure
  }                                        // out where the link really points.
  baseUrl, err := url.Parse(base)          // If it's not a relative link this
  if err != nil {                          // is a no-op.
    return ""
  }
  uri = baseUrl.ResolveReference(uri)
  return uri.String()                      // We work with parsed url objects in this
}                                          // func but we return a plain string.

However I'm not 100% sure how to do that, I'm assuming some sort of if/else or further parsing is required.

Any tips would be hugely appreciated for my learning

Upvotes: 2

Views: 402

Answers (1)

Benny Jobigan
Benny Jobigan

Reputation: 5309

I quickly read the jdanger tutorial and ran the complete example. No doubt there are a few ways to accomplish what you want to do, but here's my take.

You basically want to not enqueue any URL whose domain doesn't match some specified domain, presumably provided as a command line arg. The example uses the fixUrl() function to construct full absolute URLs and also to signal invalid URLs (by returning ""). In this function, it relies on the net/url package for parsing and such, and specifically on the URL data type. URL is a struct with this definition:

type URL struct {
    Scheme      string
    Opaque      string    // encoded opaque data
    User        *Userinfo // username and password information
    Host        string    // host or host:port
    Path        string    // path (relative paths may omit leading slash)
    RawPath     string    // encoded path hint (see EscapedPath method); added in Go 1.5
    ForceQuery  bool      // append a query ('?') even if RawQuery is empty; added in Go 1.7
    RawQuery    string    // encoded query values, without '?'
    Fragment    string    // fragment for references, without '#'
    RawFragment string    // encoded fragment hint (see EscapedFragment method); added in Go 1.15
}

The one to take note of is Host. Host is the 'whatever.com' part of an URL, including subdomains, and the port (see this wikipedia article for more info). Further reading the documentation, there is a method Hostname() which will strip the port, if present.

So, although you could add domain filtering to fixUrl(), a better design, in my opinion, would be to 'fix' the URL first, then do an addition check on the result to see its Host matches the desired domain. If it does not match, do not enqueue the URL and continue to the next item in the queue.

So, basically I think you are on the right track. I haven't included a code example to encourage you to work it out yourself, though I did add your feature to my local copy of the tutorial's program.

Upvotes: 1

Related Questions