WhiteleyJ
WhiteleyJ

Reputation: 1701

How to alter a record in a list, or data structure for rapidly adding and checking things off

I'm building a web crawler in F# and I'm running into a problem with how to store the pages I've already been to and the pages that I have yet to visit.

My current implementation involves tracking state with a list of records

type Page = {url:Uri; visited:bool; redirects:bool}
let createCrawlLink (url: Uri) = {url=url; visited=false; redirects=false}
let initialize url = [createCrawlLink(url)]
let uriInList(data:Page list)(uri:Uri) = List.exists (fun x -> x.url.AbsoluteUri = uri.AbsoluteUri) data
let add (data:Page list) (url) = 
    let uri = new Uri(url)
    match uriInList data uri with
    | true -> data
    | false -> (createCrawlLink uri) :: data 

Now when I pull the first item off that list and visit it I'd like to do a few things.

  1. Mark the url I was directed to as visited.
  2. If the url I ended up on wasn't the one I attempted to go to, mark that as visited and redirected.
  3. Add all new links to the list.
  4. Per page business logic
  5. Recurse until everything in the list is visited.

I am getting hung up on what the functional way of altering the records visited/redirected properties are. So far it seems like I have to find the record, make a copy with the properties I want changed, then copy the entire list into a new list with the old record removed and the new record added.

This seems like a lot of work but google isn't finding me any good data structures for this (or I don't know the words to search for). Is there a better cleaner way?

Upvotes: 2

Views: 208

Answers (2)

TheQuickBrownFox
TheQuickBrownFox

Reputation: 10624

I think storing pages to visit separately from pages visited makes this simpler and more efficient whether it's functional or not.

I would store visited pages in a Map<string, Page>, where string is the URL so that you have constant time access to visited pages.

Then take queued URLs to visit from the head of a list with pattern matching and build up the results in a map.

type Page = { url:Uri; redirects:bool }
type PagesVisited = Map<string, Page>

let rec crawl (urisToVisit:Uri list) (visited:PagesVisited) : PagesVisited =
    match urisToVisit with
    | uri :: remainingUris ->
        if Map.containsKey (uri:Uri).AbsoluteUri visited then
            crawl remainingUris visited
        else
            let (redirects, newUris) = visit uri
            let visited' = Map.add uri.AbsoluteUri {url=uri; redirects = redirects} visited
            crawl (newUris @ urisToVisit) visited'
    | [] ->
        printfn "Finished the internet"
        visited

// Kick it off
crawl [Uri("https://stackoverflow.com")] Map.empty

This shows you a possible functional way of doing this loop. I've left the implementation of visit for you.

Note that adding new items to the front of a list is efficient. It does not create a new copy of the list in memory. So I use the list concatenation operator @ to put what is likely to be the shorter list in front of what is likely to be longer.

Similarly, the PagesVisited map is not being copied on each loop, even though each instance is immutable. Structural sharing is used so that items can be added and removed while still holding references to previous versions of the map. This is much faster than a full copy.

If you care about making this fast and efficient more than keeping it functional, you would probably use the mutable collections ResizeArray and Dictionary instead.

Upvotes: 3

rmunn
rmunn

Reputation: 36708

You're using a list, but as ildjarn said in a comment, you should probably use a set instead. However, if you need to keep track of multiple flags per URI (has this one been visited? does this one redirect?), then you'd have to keep track of multiple sets (visitedURIs and redirectingURIs, for example).

Therefore, the data structure you probably want is the PersistentHashMap from FSharpx.Collections. It's a persistent data structure, so it's non-destructive every time you make an update in it, you get back a new hash map with the change, but the old hash map still exists unchanged so that any other functions that still have a reference to it will still see a consistent view of the data (this is a HUGE advantage when you start trying to parallelize your code!)

Also note that for lists, if you need to make frequent updates in the middle of an existing list, the PersistentVector type (also from FSharpx.Collections) is very well-suited to that.

Upvotes: 4

Related Questions