SIM
SIM

Reputation: 22440

Can't scrape particular items from some elements

What to do when there is no container or group to select to parse the required items (which are common in each group) looping through it? I'm willing to parse the text, date and author from the pasted elements. The three results I am after do not belong to any particular group or container so I can't find the right way to get them creating a loop.

Here are the elements:

html = '''
<div class="view-content">            
  <p class="text-large experts-more-h">   
  <a href="/publications/commentary/we-have-no-idea-universal-preschool-actually-helps-kids">We Have No Idea if Universal Preschool Actually Helps Kids</a>
  </p>
  <p class="text-sans">    
  By David J. Armor. Washington Post. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-10-21T09:34:00-04:00">October 21, 2014</span>.
  </p>        
  <p class="text-large experts-more-h">   
  <a href="/publications/commentary/last-parent-resistance-collective-standardized-tests">At Last, Parent Resistance to Collective Standardized Tests</a>
  </p>
  <p class="text-sans">    
  By Nat Hentoff. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-15T09:57:00-05:00">January 15, 2014</span>.
  </p>  
  <p class="text-sans">    
  By Darcy Ann Olsen and Eric Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1999-04-15T00:00:00-04:00">April 15, 1999</span>.
  </p>       
  <p class="text-large experts-more-h">   
  <a href="/publications/commentary/day-care-parents-versus-professional-advocates-0">Day Care: Parents versus Professional Advocates</a>
  </p>
  <p class="text-sans">   
  By Darcy Ann Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1998-06-01T00:00:00-04:00">June 1, 1998</span>.
  </p>  
</div>
'''

If you run my script, you can see that the scraped results is only the first one:

from lxml.html import fromstring

tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'\n', date+'\n', author)

Result:

We Have No Idea if Universal Preschool Actually Helps Kids
 October 21, 2014
 By David J. Armor. Washington Post.

If you run this one, you will see that this script is able to parse all the results I'm after:

from lxml.html import fromstring

tree = fromstring(html)
count = tree.cssselect(".text-large a")

for item in range(len(count)):
    post= tree.cssselect(".text-large a")[item].text
    date = tree.cssselect(".date-display-single")[item].text
    author = tree.cssselect(".text-sans")[item].text.strip()
    print(post+'\n', date+'\n', author)

Results:

We Have No Idea if Universal Preschool Actually Helps Kids
 October 21, 2014
 By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
 January 15, 2014
 By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
 April 15, 1999
 By Darcy Ann Olsen and Eric Olsen. Cato.org.

However, what i did with my second script is not at all pythonic and it will give wrong results if any data is missing. So, how to select a group or container, loop through it and parse all of them? Thanks in advance.

Upvotes: 2

Views: 192

Answers (1)

Andersson
Andersson

Reputation: 52665

If one of text nodes (post, date, author) is missing, tree.cssselect(selector)[index].text should return you a NoneType object which you cannot handle as a string. To avoid this you can implement

post= tree.cssselect(".text-large a")[item].text or " "

You can also try below XPath solution:

container = tree.cssselect(".text-large")

for item in container:
    post = item.xpath('./a')[0].text or " "
    date = item.xpath('./following-sibling::p/span[@class="date-display-single"]')[0].text or " "
    author = item.xpath('./following-sibling::p[@class="text-sans"]')[0].text.strip() or " "
    print(post+'\n', date+'\n', author)

Upvotes: 1

Related Questions