Reputation: 63
I am looking to use Python to scrape some data from my university's intranet and download all the research papers. I have looked at Python scraping before, but haven't really done any myself I'm sure I read about a Python scraping framework somewhere, should I use that?
So in essence this is what I need to scrape:
I will then either put all this in xml or a database, most probably xml and then develop an interface etc at a later date.
Is this doable? Any ideas on where I should start?
Thanks in advance, LukeJenx
EDIT: The framework is Scrapy
EDIT: Turns out that I nearly killed the server today so a lecturer is getting me the copies from the Network team for me... Thanks!
Upvotes: 2
Views: 2593
Reputation: 10526
Scrapy is a great framework, and has really good documentation as well. You should start there.
If you don't know XPaths, I'd recommend you learn them if you plan to use Scrapy (they're extremely easy!). XPaths help you "locate" specific elements inside the html that you'd want to extract.
Scrapy already has a built-in command line argument to export to xml, csv, etc. i.e. scrapy crawl <spidername> -o <filename> -t xml
Mechanize is another great option for writing scrapers easily.
Upvotes: 2
Reputation: 4185
Yes, this is very do-able, although this depends a lot on the pages. As implied in the comments, a JS-heavy site could make this very difficult though.
That aside, for downloading use the standard urllib2
, or look at Requests for a lighter, less painful experience.
However, best not to use regexes to parse HTML, it might cause a world of endless screaming. Seriously though, try BeautifulSoup instead - it's powerful and quite high-level.
For storage, whichever's easiest (to me XML seems overkill, consider the json library perhaps).
Upvotes: 1