Hennes
Hennes

Reputation: 1370

Dynamic Programming with Vectors in Haskell

I'm trying to code a kind of a simple web crawler in haskell just for practice. To my own astonishment neither the web request itself nor parsing the web site was any complicated. I coded the program purely functional with a recursive function, but only some fourty or fifty web requests later, the program eats up all the memory.

So I tried to do the task with dynamic programming, but here I'm totally stuck, which means, I have no idea where to begin. In this tiny program I got so many errors, that I'm not able to figure out, where to start.

This is my current concept:

scanPage :: String -> IO (String,String,[String])
scanPage url = ....

crawler :: String -> IO [(String, Int)]
crawler startUrl = runST $ do
    toVisit <- newSTRef [startUrl] :: ST s (STRef s [String])
    visited <- newSTRef [] :: ST s (STRef s [String])
    result  <- newSTRef [] :: ST s (STRef s [(String, Int)])
    -- Iterate over urls to visit
    while (liftM not $ liftM null $ readSTRef toVisit) $ do
        url <- fmap (head) (readSTRef toVisit)            
        (moreUrls, value_a, value_b) <- scanPage url
        -- Mark page as visited
        vis <- readSTRef visited
        writeSTRef visited (url : vis)
        -- Add Results
        res <- readSTRef result
        writeSTRef result ((value_a, value_b) : res)
        -- Extend urls to visit
        nextUrls <- readSTRef toVisit
        writeSTRef toVisit (nextUrls ++ (moreUrls \\ vis))
    -- End of while
    return =<< readSTRef result

main = do
    putStrLn =<< fmap show (crawler "http://starturl.com")

I already wrote a lot of programs like this with arrays, which are much more convenient, as I can simply write or read from or to array elements. So I thought I could use mutable vectors for these lists, but they can't grow (at least in the same instance) or shrink. So I ended up with simple lists in STRef.

The first line I can't get to work is the line with the while command. I wrote my own while function like this

while :: (Monad m) => m Bool -> m a -> m ()
while cond action = do
    c <- cond
    when c $ do
        action
        while cond action

because I couldn't find any other while command. I googled many days for mutable vectors, but was not able to find a single tutorial or even example that I could use here. Please, can anyone tell me, how to write a syntactical correct crawler function? Yes, a pure functional solution would be nicer and more "haskellish", but I'm considering me still as a beginner and all this monad-stuff is still a bit strange for me. I'm willing to learn, but a hint or even an example would be really awesome.

EDIT: Here comes some pseudocode of my messy code.

toVisitList = startURL
visitedList = []
resultList = []
while (length toVisitList /= 0) {
    url = head toVisitList   -- Get the 1st element
    toVisitList -= url       -- Remove this url from list
    visitedList += url       -- Append url to visitedList
    (moreUrls, val_a, val_b) = scanPage url
    resultList += (val_a, val_b) -- append the result
    toVisitList += (moreUrls - visitedList)
}
return resultList

EDIT: I still haven't any clue, how to put this pseudocode into real code, especially the while-statement. Any hints appreciacted.

Upvotes: 2

Views: 228

Answers (1)

dfeuer
dfeuer

Reputation: 48581

The natural data structure for your toVisitList is a queue. There are a few implementations of queues around, but for this purpose, the simplest thing is to just use Data.Sequence.Seq. This lets you add things to the end with |> or <>, and to view the beginning with viewl. Consider something like

crawlOnce :: Seq Url -> [Url] -> IO (Either [Url] (Seq Url, [Url]))

crawlOnce toVisitList visitedList uses viewl to look at the front of the list of URLs to visit. If it's empty, it returns Left visitedList. Otherwise, it visits the first URL, appends it to the visited list, and adds the newly discovered URLS to the list to visit, then wraps them up in Right.

There are several reasonable variations. For instance, you could go for a type like ExceptT [Url] (StateT (Seq Url, [Url]) IO) a that "throws" its final result.

Upvotes: 1

Related Questions