KJW
KJW

Reputation: 15251

Scrapy: Item discrepancy

Scenario: a page with multiple items, each consisting of title, description, image. What happens when one of the items are missing the title? How does scrapy handle it? It seems that scrapy blindly selects all titles //div[id='content']/ul/li/div[id='title']/text(),

Expected output is that that row will have a missing title. But I fear that since it blindly selects all titles on the page without considering the item context. If the 5th item is missing title, wouldn't it mistakenly use the 6th item's title instead?

title1 | description | image
.
.
title4 | description | image
title6 | description | image  <--- it's supposed to be missing the title.
       | description | image 

Does scrapy have a way to deal with this problem?

A workaround I was thinking would be to look at the parent item element, and then, look inside that item. If something is missing don't show it.

Upvotes: 1

Views: 195

Answers (1)

akhter wahab
akhter wahab

Reputation: 4085

there are variety of ways you can handle this situation

1) you can implement a pipeline that can skip items that are not required

2) you can add check in your extraction part to only yield/return an item that is required

you needs to understand Scrapy is a high level crawling Framework , that is also providing builten support for data extraction , you can use any library for extraction you would like to.

Upvotes: 2

Related Questions