Ignore external links in go web crawler

Question

I'm really new to go, and I'm playing with it at the moment by building a simple web crawler following this tutorial: https://jdanger.com/build-a-web-crawler-in-go.html

It's broken down really nicely, but I want to put something in place so that the only links which are enqueued are part of the main domain, and not external.

So let's say I'm crawling https://www.mywebsite.com, I only want to include things like https://www.mywebsite.com/about-us or https://www.mywebsite.com/contact - I don't want subdomains, such as https://subdomain.mywebsite.com or external links found like https://www.facebook.com as I do not want the crawler to fall into a black hole.

Looking at the code, I think I need to make the change to this function which fixes relative links:

func fixUrl(href, base string) (string) {  // given a relative link and the page on
  uri, err := url.Parse(href)              // which it's found we can parse them
  if err != nil {                          // both and use the url package's
    return ""                              // ResolveReference function to figure
  }                                        // out where the link really points.
  baseUrl, err := url.Parse(base)          // If it's not a relative link this
  if err != nil {                          // is a no-op.
    return ""
  }
  uri = baseUrl.ResolveReference(uri)
  return uri.String()                      // We work with parsed url objects in this
}                                          // func but we return a plain string.

However I'm not 100% sure how to do that, I'm assuming some sort of if/else or further parsing is required.

Any tips would be hugely appreciated for my learning

Ignore external links in go web crawler

Answers (1)

Related Questions