Reputation: 2481
I was writing an generic enumerator to scrape sites as an exercise and I did it, and it is complete and works fine, but I have a question. You can find it here: https://github.com/mindreader/scrape-enumerator if you want to look at the code.
The basic idea is I wanted an enumerator that spits out site defined entries on pages like search engines, blogs, things where you have to fetch a page, and it will have 25 entries, and you want one entry at a time. But at the same time I didn't want to write the plumbing for every site, so I wanted a generic interface. What I came up with is this (this uses type families):
class SiteEnum a where
type Result a :: *
urlSource :: a -> InputUrls (Int,Int)
enumResults :: a -> L.ByteString -> Maybe [Result a]
data InputUrls state =
UrlSet [URL] |
UrlFunc state (state -> (state,URL)) |
UrlPageDependent URL (L.ByteString -> Maybe URL)
In order to do this on every type of site, this requires a url source of some sort, which could be a list (possibly infinite) of pregenerated urls, or it could be an initial state and something to generate urls from it (like if the urls contained &page=1, &page=2, etc), and then for really screwed up pages like google, give an initial url and then provide a function that will search the body for the next link and then use that. Your site makes a data type an instance of SiteEnum and gives a type to Result which is site dependent and now the enumerator deals with all the I/O, and you don't have to think about it. This works perfectly and I implemented one site with it.
My question is that there is an annoyance with this implementation is the InputUrls datatype. When I use UrlFunc everything is golden. When I use UrlSet or UrlPageDependent, it isn't all fun and games because the state type is undefined, and I have to cast it to :: InputUrls () in order for it to compile. This seems totally unnecessary as that type variable due to the way the program is written, will never be used for the majority of sites, but I don't know how to get around it. I'm finding that I want to use types like this in a lot of different contexts, and I always end up with stray type variables that only are needed in certain pieces of the datatype, but it doesn't feel like I should be using it this way. Is there a better way of doing this?
Upvotes: 4
Views: 176
Reputation: 139830
Why do you need the UrlFunc
case at all? From what I understand, the only thing you're doing with the state function is using it to build a list like the one in UrlSet
anyway, so instead of storing the state function, just store the resulting list. That way, you can eliminate the state
type variable from your data type, which should eliminate the ambiguity problems.
Upvotes: 2