Jeroen Bourgois
Jeroen Bourgois

Reputation: 180

Web scraping loop with Haskell

I want to learn Haskell and I have another small project (currently in Elixir) that I'd like to port as an exercise. It is a simple web scraper that scrapes a list of urls.

Imagine having a list of zip codes, around 2500 items. For each entry, a web page should be scraped, in the form of http://www.acme.org/zip-info?zip={ZIP}. I managed to write the code to crawl a single web page using Scalpel.

But how would I go about scraping the 2500 items? In Elixir I map over the list of postal codes and after each page request there is a short sleep of 1 second, just to ease off pressure on the targeted website. It is not important to me to scrape the website as fast as possible.

How would I do this in Haskell? I read about threadSleep but how do I use that in combination of the list to traverse and the main method, since the sleep is side effect.

Thanks for the insights!

Upvotes: 2

Views: 111

Answers (1)

Noughtmare
Noughtmare

Reputation: 10645

Presumably you already have a function like:

scrapeZip :: Zip -> IO ZipResult

Then you can write a function with traverse to get an IO action that returns a list of zip results:

scrapeZips :: [Zip] -> IO [ZipResult]
scrapeZips zipCodes = traverse scrapeZip zipCodes

But you want to add a delay, which can be done using threadDelay (you can import it from Control.Concurrent):

scrapeZipDelay :: Zip -> IO ZipResult
scrapeZipDelay zip = do
  x <- scrapeZip zip
  threadDelay 1000000 -- one second in microseconds
  return x

And then you can use this scrapeZipDelay with traverse:

scrapeZipsDelay :: [Zip] -> IO [ZipResult]
scrapeZipsDelay zipCodes = traverse scrapeZipDelay zipCodes

Instead of defining a whole new scrapeZipDelay function you can also write a pretty small version with the <* operator:

scrapeZipsDelay :: [Zip] -> IO [ZipResult]
scrapeZipsDelay zipCodes = 
  traverse (\zip -> scrapeZip zip <* threadDelay 1000000) zipCodes

Upvotes: 4

Related Questions