p_sutherland
p_sutherland

Reputation: 491

How to extract text from lxml.etree tags based on value of sibling tags

My objective is to pull urls from an xml document (linked) and put them in a list: https://www.valuespreadsheet.com/iedgar/results.php?stock=NFLX&output=xml

I imported etree from lxml and created a list comprehension that pulls the text from all <instanceUrl> tags.

url = 'https://valuespreadsheet.com/iedgar/results.php?stock=NFLX&output=xml' 
et = etree.fromstring(urlopen(url).read())
return [_.find('instanceUrl').text for _ in et.find('filings')]

Now, I want to restrict the list so that it only pulls the text from <instanceUrl> tags where <formType>=10K.

How can I achieve this?

Upvotes: 1

Views: 123

Answers (1)

alecxe
alecxe

Reputation: 474171

You need an XPath expression and the xpath() method :

[url.text for url in et.xpath('//filing[formType = "10-K"]/instanceUrl')]

Here we are filtering the filing nodes that contain formType child nodes with 10-K text, then getting the instanceUrl child.

Note that the _ variable name is used for throw-away variables - variables that have to be defined but not actually used (e.g. during unpacking). In your case, you've actually used it.

Upvotes: 2

Related Questions