Reputation: 1370
I'm trying to code a kind of a simple web crawler in haskell just for practice. To my own astonishment neither the web request itself nor parsing the web site was any complicated. I coded the program purely functional with a recursive function, but only some fourty or fifty web requests later, the program eats up all the memory.
So I tried to do the task with dynamic programming, but here I'm totally stuck, which means, I have no idea where to begin. In this tiny program I got so many errors, that I'm not able to figure out, where to start.
This is my current concept:
scanPage :: String -> IO (String,String,[String])
scanPage url = ....
crawler :: String -> IO [(String, Int)]
crawler startUrl = runST $ do
toVisit <- newSTRef [startUrl] :: ST s (STRef s [String])
visited <- newSTRef [] :: ST s (STRef s [String])
result <- newSTRef [] :: ST s (STRef s [(String, Int)])
-- Iterate over urls to visit
while (liftM not $ liftM null $ readSTRef toVisit) $ do
url <- fmap (head) (readSTRef toVisit)
(moreUrls, value_a, value_b) <- scanPage url
-- Mark page as visited
vis <- readSTRef visited
writeSTRef visited (url : vis)
-- Add Results
res <- readSTRef result
writeSTRef result ((value_a, value_b) : res)
-- Extend urls to visit
nextUrls <- readSTRef toVisit
writeSTRef toVisit (nextUrls ++ (moreUrls \\ vis))
-- End of while
return =<< readSTRef result
main = do
putStrLn =<< fmap show (crawler "http://starturl.com")
I already wrote a lot of programs like this with arrays, which are much more convenient, as I can simply write or read from or to array elements. So I thought I could use mutable vectors for these lists, but they can't grow (at least in the same instance) or shrink. So I ended up with simple lists in STRef.
The first line I can't get to work is the line with the while command. I wrote my own while function like this
while :: (Monad m) => m Bool -> m a -> m ()
while cond action = do
c <- cond
when c $ do
action
while cond action
because I couldn't find any other while command. I googled many days for mutable vectors, but was not able to find a single tutorial or even example that I could use here. Please, can anyone tell me, how to write a syntactical correct crawler function? Yes, a pure functional solution would be nicer and more "haskellish", but I'm considering me still as a beginner and all this monad-stuff is still a bit strange for me. I'm willing to learn, but a hint or even an example would be really awesome.
EDIT: Here comes some pseudocode of my messy code.
toVisitList = startURL
visitedList = []
resultList = []
while (length toVisitList /= 0) {
url = head toVisitList -- Get the 1st element
toVisitList -= url -- Remove this url from list
visitedList += url -- Append url to visitedList
(moreUrls, val_a, val_b) = scanPage url
resultList += (val_a, val_b) -- append the result
toVisitList += (moreUrls - visitedList)
}
return resultList
EDIT: I still haven't any clue, how to put this pseudocode into real code, especially the while-statement. Any hints appreciacted.
Upvotes: 2
Views: 228
Reputation: 48581
The natural data structure for your toVisitList
is a queue. There are a few implementations of queues around, but for this purpose, the simplest thing is to just use Data.Sequence.Seq
. This lets you add things to the end with |>
or <>
, and to view the beginning with viewl
. Consider something like
crawlOnce :: Seq Url -> [Url] -> IO (Either [Url] (Seq Url, [Url]))
crawlOnce toVisitList visitedList
uses viewl
to look at the front of the list of URLs to visit. If it's empty, it returns Left visitedList
. Otherwise, it visits the first URL, appends it to the visited list, and adds the newly discovered URLS to the list to visit, then wraps them up in Right
.
There are several reasonable variations. For instance, you could go for a type like ExceptT [Url] (StateT (Seq Url, [Url]) IO) a
that "throws" its final result.
Upvotes: 1