LukeJenx
LukeJenx

Reputation: 63

Python web scraping - Download a file and store all data in xml

I am looking to use Python to scrape some data from my university's intranet and download all the research papers. I have looked at Python scraping before, but haven't really done any myself I'm sure I read about a Python scraping framework somewhere, should I use that?

So in essence this is what I need to scrape:

  1. Authors
  2. Description
  3. Field
  4. Then download the file and rename with the paper name.

I will then either put all this in xml or a database, most probably xml and then develop an interface etc at a later date.

Is this doable? Any ideas on where I should start?

Thanks in advance, LukeJenx

EDIT: The framework is Scrapy

EDIT: Turns out that I nearly killed the server today so a lecturer is getting me the copies from the Network team for me... Thanks!

Upvotes: 2

Views: 2593

Answers (2)

Anuj Gupta
Anuj Gupta

Reputation: 10526

Scrapy is a great framework, and has really good documentation as well. You should start there.

If you don't know XPaths, I'd recommend you learn them if you plan to use Scrapy (they're extremely easy!). XPaths help you "locate" specific elements inside the html that you'd want to extract.

Scrapy already has a built-in command line argument to export to xml, csv, etc. i.e. scrapy crawl <spidername> -o <filename> -t xml

Mechanize is another great option for writing scrapers easily.

Upvotes: 2

declension
declension

Reputation: 4185

Yes, this is very do-able, although this depends a lot on the pages. As implied in the comments, a JS-heavy site could make this very difficult though.

That aside, for downloading use the standard urllib2, or look at Requests for a lighter, less painful experience.

However, best not to use regexes to parse HTML, it might cause a world of endless screaming. Seriously though, try BeautifulSoup instead - it's powerful and quite high-level.

For storage, whichever's easiest (to me XML seems overkill, consider the json library perhaps).

Upvotes: 1

Related Questions