Reputation: 571
Given the URL http://www.smartmoney.com/quote/FAST/?story=financials&timewindow=1&opt=YB&isFinprint=1&framework.view=smi_emptyView , how would you capture and print the contents of an entire row of data?
For example, what would it take to get an output that looked something like: "Cash & Short Term Investments 144,841 169,760 189,252 86,743 57,379"? Or something like "Property, Plant & Equipment - Gross 725,104 632,332 571,467 538,805 465,493"?
I've been introduced to the basics of Xpath through sites http://www.techchorus.net/web-scraping-lxml . However, the Xpath syntax is still largely a mystery to me.
I already have successfully done this in BeautifulSoup. I like the fact that BeautifulSoup doesn't require me to know the structure of the file - it just looks for the element containing the text I search for. Unfortunately, BeautifulSoup is too slow for a script that has to do this THOUSANDS of times. The source code for my task in BeautifulSoup is (with title_input equal to "Cash & Short Term Investments"):
page = urllib2.urlopen (url_local)
soup = BeautifulSoup (page)
soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
list_output = soup_line_item.findAll('td') # List of elements
So what would the equivalent code in lxml be?
EDIT 1: The URLs were concealed the first time I posted. I have now fixed that.
EDIT 2: I have added my BeautifulSoup-based solution to clarify what I'm trying to do.
EDIT 3: +10 to root for your solution. For the benefit of future developers with the same question, I'm posting here a quick-and-dirty script that worked for me:
#!/usr/bin/env python
import urllib
import lxml.html
url = 'balancesheet.html'
result = urllib.urlopen(url)
html = result.read()
doc = lxml.html.document_fromstring(html)
x = doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
print x
Upvotes: 0
Views: 2661
Reputation: 443
let html holds the html source code:
import lxm.html
doc = lxml.html.document_fromstring(html)
rows_element = doc.xpath('/html/body/div/div[2]/div/div[5]/div/div/table/tbody/tr')
for row in rows_element:
print row.text_content()
not tested but should work
P.S.Install xpath cheker or firefinder in firefox to help you with xpath
Upvotes: 0
Reputation: 80396
In [18]: doc.xpath(u'.//th[div[text()="Cash & Short Term Investments"]]/following-sibling::td/text()')
Out[18]: [' 144,841', ' 169,760', ' 189,252', ' 86,743', ' 57,379']
or you can define a little function to get the rows by text:
In [19]: def func(doc,txt):
...: exp=u'.//th[div[text()="{0}"]]'\
...: u'/following-sibling::td/text()'.format(txt)
...: return [i.strip() for i in doc.xpath(exp)]
In [20]: func(doc,u'Total Accounts Receivable')
Out[20]: ['338,594', '270,133', '214,169', '244,940', '236,331']
or you can get all the rows to a dict
:
In [21]: d={}
In [22]: for i in doc.xpath(u'.//tbody/tr'):
...: if len(i.xpath(u'.//th/div/text()')):
...: d[i.xpath(u'.//th/div/text()')[0]]=\
...: [e.strip() for e in i.xpath(u'.//td/text()')]
In [23]: d.items()[:3]
Out[23]:
[('Accounts Receivables, Gross',
['344,241', '274,894', '218,255', '247,600', '238,596']),
('Short-Term Investments',
['27,165', '26,067', '24,400', '851', '159']),
('Cash & Short Term Investments',
['144,841', '169,760', '189,252', '86,743', '57,379'])]
Upvotes: 1