Reputation: 22440
What to do when there is no container or group to select to parse the required items (which are common in each group) looping through it? I'm willing to parse the text, date and author from the pasted elements. The three results I am after do not belong to any particular group or container so I can't find the right way to get them creating a loop.
Here are the elements:
html = '''
<div class="view-content">
<p class="text-large experts-more-h">
<a href="/publications/commentary/we-have-no-idea-universal-preschool-actually-helps-kids">We Have No Idea if Universal Preschool Actually Helps Kids</a>
</p>
<p class="text-sans">
By David J. Armor. Washington Post. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-10-21T09:34:00-04:00">October 21, 2014</span>.
</p>
<p class="text-large experts-more-h">
<a href="/publications/commentary/last-parent-resistance-collective-standardized-tests">At Last, Parent Resistance to Collective Standardized Tests</a>
</p>
<p class="text-sans">
By Nat Hentoff. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-15T09:57:00-05:00">January 15, 2014</span>.
</p>
<p class="text-sans">
By Darcy Ann Olsen and Eric Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1999-04-15T00:00:00-04:00">April 15, 1999</span>.
</p>
<p class="text-large experts-more-h">
<a href="/publications/commentary/day-care-parents-versus-professional-advocates-0">Day Care: Parents versus Professional Advocates</a>
</p>
<p class="text-sans">
By Darcy Ann Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1998-06-01T00:00:00-04:00">June 1, 1998</span>.
</p>
</div>
'''
If you run my script, you can see that the scraped results is only the first one:
from lxml.html import fromstring
tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'\n', date+'\n', author)
Result:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
If you run this one, you will see that this script is able to parse all the results I'm after:
from lxml.html import fromstring
tree = fromstring(html)
count = tree.cssselect(".text-large a")
for item in range(len(count)):
post= tree.cssselect(".text-large a")[item].text
date = tree.cssselect(".date-display-single")[item].text
author = tree.cssselect(".text-sans")[item].text.strip()
print(post+'\n', date+'\n', author)
Results:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
January 15, 2014
By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
April 15, 1999
By Darcy Ann Olsen and Eric Olsen. Cato.org.
However, what i did with my second script is not at all pythonic and it will give wrong results if any data is missing. So, how to select a group or container, loop through it and parse all of them? Thanks in advance.
Upvotes: 2
Views: 192
Reputation: 52665
If one of text nodes (post
, date
, author
) is missing, tree.cssselect(selector)[index].text
should return you a NoneType
object which you cannot handle as a string. To avoid this you can implement
post= tree.cssselect(".text-large a")[item].text or " "
You can also try below XPath
solution:
container = tree.cssselect(".text-large")
for item in container:
post = item.xpath('./a')[0].text or " "
date = item.xpath('./following-sibling::p/span[@class="date-display-single"]')[0].text or " "
author = item.xpath('./following-sibling::p[@class="text-sans"]')[0].text.strip() or " "
print(post+'\n', date+'\n', author)
Upvotes: 1