Can't scrape particular items from some elements

Question

What to do when there is no container or group to select to parse the required items (which are common in each group) looping through it? I'm willing to parse the text, date and author from the pasted elements. The three results I am after do not belong to any particular group or container so I can't find the right way to get them creating a loop.

Here are the elements:

html = '''
            
     
  We Have No Idea if Universal Preschool Actually Helps Kids
  
      
  By David J. Armor. Washington Post. October 21, 2014.
          
     
  At Last, Parent Resistance to Collective Standardized Tests
  
      
  By Nat Hentoff. Cato.org. January 15, 2014.
    
      
  By Darcy Ann Olsen and Eric Olsen. Cato.org. April 15, 1999.
         
     
  Day Care: Parents versus Professional Advocates
  
     
  By Darcy Ann Olsen. Cato.org. June 1, 1998.
    

'''

If you run my script, you can see that the scraped results is only the first one:

from lxml.html import fromstring

tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'
', date+'
', author)

Result:

We Have No Idea if Universal Preschool Actually Helps Kids
 October 21, 2014
 By David J. Armor. Washington Post.

If you run this one, you will see that this script is able to parse all the results I'm after:

from lxml.html import fromstring

tree = fromstring(html)
count = tree.cssselect(".text-large a")

for item in range(len(count)):
    post= tree.cssselect(".text-large a")[item].text
    date = tree.cssselect(".date-display-single")[item].text
    author = tree.cssselect(".text-sans")[item].text.strip()
    print(post+'
', date+'
', author)

Results:

We Have No Idea if Universal Preschool Actually Helps Kids
 October 21, 2014
 By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
 January 15, 2014
 By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
 April 15, 1999
 By Darcy Ann Olsen and Eric Olsen. Cato.org.

However, what i did with my second script is not at all pythonic and it will give wrong results if any data is missing. So, how to select a group or container, loop through it and parse all of them? Thanks in advance.

Andersson · Accepted Answer

If one of text nodes (post, date, author) is missing, tree.cssselect(selector)[index].text should return you a NoneType object which you cannot handle as a string. To avoid this you can implement

post= tree.cssselect(".text-large a")[item].text or " "

You can also try below XPath solution:

container = tree.cssselect(".text-large")

for item in container:
    post = item.xpath('./a')[0].text or " "
    date = item.xpath('./following-sibling::p/span[@class="date-display-single"]')[0].text or " "
    author = item.xpath('./following-sibling::p[@class="text-sans"]')[0].text.strip() or " "
    print(post+'
', date+'
', author)

Can't scrape particular items from some elements

Answers (1)

Related Questions

Can&#39;t scrape particular items from some elements

Answers (1)

Related Questions

Can't scrape particular items from some elements