Reputation: 1701
I'm building a web crawler in F# and I'm running into a problem with how to store the pages I've already been to and the pages that I have yet to visit.
My current implementation involves tracking state with a list of records
type Page = {url:Uri; visited:bool; redirects:bool}
let createCrawlLink (url: Uri) = {url=url; visited=false; redirects=false}
let initialize url = [createCrawlLink(url)]
let uriInList(data:Page list)(uri:Uri) = List.exists (fun x -> x.url.AbsoluteUri = uri.AbsoluteUri) data
let add (data:Page list) (url) =
let uri = new Uri(url)
match uriInList data uri with
| true -> data
| false -> (createCrawlLink uri) :: data
Now when I pull the first item off that list and visit it I'd like to do a few things.
I am getting hung up on what the functional way of altering the records visited/redirected properties are. So far it seems like I have to find the record, make a copy with the properties I want changed, then copy the entire list into a new list with the old record removed and the new record added.
This seems like a lot of work but google isn't finding me any good data structures for this (or I don't know the words to search for). Is there a better cleaner way?
Upvotes: 2
Views: 208
Reputation: 10624
I think storing pages to visit separately from pages visited makes this simpler and more efficient whether it's functional or not.
I would store visited pages in a Map<string, Page>
, where string
is the URL so that you have constant time access to visited pages.
Then take queued URLs to visit from the head of a list with pattern matching and build up the results in a map.
type Page = { url:Uri; redirects:bool }
type PagesVisited = Map<string, Page>
let rec crawl (urisToVisit:Uri list) (visited:PagesVisited) : PagesVisited =
match urisToVisit with
| uri :: remainingUris ->
if Map.containsKey (uri:Uri).AbsoluteUri visited then
crawl remainingUris visited
else
let (redirects, newUris) = visit uri
let visited' = Map.add uri.AbsoluteUri {url=uri; redirects = redirects} visited
crawl (newUris @ urisToVisit) visited'
| [] ->
printfn "Finished the internet"
visited
// Kick it off
crawl [Uri("https://stackoverflow.com")] Map.empty
This shows you a possible functional way of doing this loop. I've left the implementation of visit
for you.
Note that adding new items to the front of a list is efficient. It does not create a new copy of the list in memory. So I use the list concatenation operator @
to put what is likely to be the shorter list in front of what is likely to be longer.
Similarly, the PagesVisited
map is not being copied on each loop, even though each instance is immutable. Structural sharing is used so that items can be added and removed while still holding references to previous versions of the map. This is much faster than a full copy.
If you care about making this fast and efficient more than keeping it functional, you would probably use the mutable collections ResizeArray
and Dictionary
instead.
Upvotes: 3
Reputation: 36708
You're using a list, but as ildjarn said in a comment, you should probably use a set instead. However, if you need to keep track of multiple flags per URI (has this one been visited? does this one redirect?), then you'd have to keep track of multiple sets (visitedURIs
and redirectingURIs
, for example).
Therefore, the data structure you probably want is the PersistentHashMap
from FSharpx.Collections. It's a persistent data structure, so it's non-destructive every time you make an update in it, you get back a new hash map with the change, but the old hash map still exists unchanged so that any other functions that still have a reference to it will still see a consistent view of the data (this is a HUGE advantage when you start trying to parallelize your code!)
Also note that for lists, if you need to make frequent updates in the middle of an existing list, the PersistentVector
type (also from FSharpx.Collections) is very well-suited to that.
Upvotes: 4