Reputation: 1044
I'm really new to go, and I'm playing with it at the moment by building a simple web crawler following this tutorial: https://jdanger.com/build-a-web-crawler-in-go.html
It's broken down really nicely, but I want to put something in place so that the only links which are enqueued are part of the main domain, and not external.
So let's say I'm crawling https://www.mywebsite.com, I only want to include things like https://www.mywebsite.com/about-us or https://www.mywebsite.com/contact - I don't want subdomains, such as https://subdomain.mywebsite.com or external links found like https://www.facebook.com as I do not want the crawler to fall into a black hole.
Looking at the code, I think I need to make the change to this function which fixes relative links:
func fixUrl(href, base string) (string) { // given a relative link and the page on
uri, err := url.Parse(href) // which it's found we can parse them
if err != nil { // both and use the url package's
return "" // ResolveReference function to figure
} // out where the link really points.
baseUrl, err := url.Parse(base) // If it's not a relative link this
if err != nil { // is a no-op.
return ""
}
uri = baseUrl.ResolveReference(uri)
return uri.String() // We work with parsed url objects in this
} // func but we return a plain string.
However I'm not 100% sure how to do that, I'm assuming some sort of if/else or further parsing is required.
Any tips would be hugely appreciated for my learning
Upvotes: 2
Views: 402
Reputation: 5309
I quickly read the jdanger tutorial and ran the complete example. No doubt there are a few ways to accomplish what you want to do, but here's my take.
You basically want to not enqueue any URL whose domain doesn't match some specified domain, presumably provided as a command line arg. The example uses the fixUrl()
function to construct full absolute URLs and also to signal invalid URLs (by returning ""
). In this function, it relies on the net/url
package for parsing and such, and specifically on the URL
data type. URL
is a struct
with this definition:
type URL struct {
Scheme string
Opaque string // encoded opaque data
User *Userinfo // username and password information
Host string // host or host:port
Path string // path (relative paths may omit leading slash)
RawPath string // encoded path hint (see EscapedPath method); added in Go 1.5
ForceQuery bool // append a query ('?') even if RawQuery is empty; added in Go 1.7
RawQuery string // encoded query values, without '?'
Fragment string // fragment for references, without '#'
RawFragment string // encoded fragment hint (see EscapedFragment method); added in Go 1.15
}
The one to take note of is Host
. Host
is the 'whatever.com' part of an URL, including subdomains, and the port (see this wikipedia article for more info). Further reading the documentation, there is a method Hostname()
which will strip the port, if present.
So, although you could add domain filtering to fixUrl()
, a better design, in my opinion, would be to 'fix' the URL first, then do an addition check on the result to see its Host
matches the desired domain. If it does not match, do not enqueue the URL and continue to the next item in the queue.
So, basically I think you are on the right track. I haven't included a code example to encourage you to work it out yourself, though I did add your feature to my local copy of the tutorial's program.
Upvotes: 1