Reputation: 245
Google's finance API is incomplete -- many of the figures on a page such as:
http://www.google.com/finance?fstype=ii&q=NYSE:GE
are not available via the API.
I need this data to rank companies on Canadian stock exchanges according to the formula of Greenblatt, available via google search for "greenblatt index scans".
My question: what is the most intelligent/clean/efficient way of accessing and processing the data on these webpages. Is the tedious approach really necessary in this case, and if so, what is the best way of going about it? I'm currently learning Python for projects related to this one.
Upvotes: 5
Views: 4756
Reputation: 3565
You could try asking Google to provide the missing APIs. Otherwise, you're stuck with screen scraping, which is never fun, prone to breaking without notice, and likely in violation of Google's terms of service.
But, if you still want to write a screen scraper, it's hard to beat a combination of mechanize and BeautifulSoup. BeautifulSoup is an HTML parser and mechanize is a Python-based web browser that will let you log in, store cookies, and generally navigate around like any other web browser.
Upvotes: 4
Reputation: 5620
BeautifulSoup would be the preferred method of HTML parsing with Python
Have you looked into options besides Google (e.g. Yahoo Finance API)?
Upvotes: 3
Reputation: 50642
Scraping web pages always sucks, but I would recommend converting them to xml (via tidy or some other HTML -> XML program) and then using xpath to walk the nodes that you are interested in.
Upvotes: 0