TijnvdEijnde
TijnvdEijnde

Reputation: 191

How to deal with empty fields in scrapy when using keys

I have made a spider in scrapy that can successfully scrape data from a website.

   def parse(self, response):
            for text in response.css('div.row'):
                yield {
                    'product': text.css('div.item a.item::text').get(),
                    'test1': text.css('div.item span::text')[0].get(),
                    'test2': text.css('div.item span::text')[1].get(),

This is not the complete code, but this should be enough to explain the problem.

The problem occurs when the 'test2': text.css('div.item span::text')[1].get(), is empty.

It will give an IndexError: list index out of range, which makes sense. But how can I check if the value is empty so I can replace it with a default?

  1. I know the get() has a default parameters get(default=''), unfortunately because I use keys [0] this parameters is not available.
  2. I was looking into ternary expressions but I could not find a way to do this inside which I think is a dictionary.

Upvotes: 0

Views: 282

Answers (1)

furas
furas

Reputation: 142631

First get items = text.css(...),

next check if len(items) > 0 before you use items[0]
and if len(items) > 1 before you use items[1]

    def parse(self, response):
        for text in response.css('div.row'):
            items = text.css('div.item span::text')
            yield {
                'product': text.css('div.item a.item::text').get(),
                'test1': items[0].get() if len(items) > 0 else "",
                'test2': items[1].get() if len(items) > 1 else "",

EDIT:

You can also use CSS :nth-of-type(1) instead of [0] in a.item:nth-of-type(1)::text

'div.item a.item:nth-of-type(1)::text'

Or xpath with [1]

'(.//div[@class="item"]/a[@class="item"])[1]/text()'

Scrapy uses module parsel so I created minimal working code with parsel

text = '''
<div class="item">
<a class="item" href="a">a</a>
<a class="item" href="b">b</a>
</div>
'''

import parsel

s = parsel.Selector(text)

print(s.css('div.item a.item:nth-of-type(1)::text').get('empty')) # a
print(s.css('div.item a.item:nth-of-type(2)::text').get('empty')) # b
print(s.css('div.item a.item:nth-of-type(3)::text').get('empty')) # empty


print(s.xpath('(.//div[@class="item"]/a[@class="item"])[1]/text()').get('empty'))
print(s.xpath('(.//div[@class="item"]/a[@class="item"])[2]/text()').get('empty'))
print(s.xpath('(.//div[@class="item"]/a[@class="item"])[3]/text()').get('empty'))

Upvotes: 2

Related Questions