Reputation: 520
I want to crawl websites based on sites that are linked to the webpage in HTML.
However I am concerned about ending up a a variety of "not so child friendly sites". Does anyone know of a list of blacklist sites I could start with to implement my own filters to stay away from (at least some of) the shader places?
Thanks!
Upvotes: 1
Views: 1295
Reputation: 5751
A very good source for well maintained blacklists for different genres is provided by the University of Toulouse. You can find them here.
An other approach would be to use a focused crawler and let a classifier decide, if a given page is worth being crawled or not for your specific domain of interest.
Upvotes: 4
Reputation: 4854
A slightly different approach would be to use opendns familyshied and configure the DNS on the server(s) running your crawler. You could then have a custom filter in your crawler to detect pages filtered by opendns and prevent them from being indexed or stored.
You wouldn't have to handle and manage the blacklists and let opendns do that for you instead.
Upvotes: 0