Reputation: 9
I am trying to download the full archive files of this website (http://www.afghanislamicpress.com/).
I tried using DeepVacuum (http://www.hexcat.com/deepvacuum/index.html) but the site is dynamic (I think that's the right word).
So you submit a form that gives the article archive, but it only spits out 5 at a time (i.e. per page) and then you have to click through. I want to download all the individual articles for the full data set, but don't want to manually click through.
I know there's some easy way to do this, but not entirely sure how.
Any suggestions for a novice at doing data scraping etc?
Upvotes: 0
Views: 139
Reputation: 150148
The most straightforward solution would be to contact the owner of the website and request their permission to republish their articles, and ask for a digital copy.
You can certainly automate pulling down content that is paged, but it requires some programming effort. The best tool for that imho is HTML Agility Pack.
Please be sure and comply with copyright and licensing terms of the content you are downloading.
Upvotes: 1