Reputation: 1061
As a learning experiment for familiarizing with Scrapy I'm writing a Scraper which checks all the links of a HTML page and reports the status codes of HTTP HEAD requests directed to them. Fact is, in one of my item definitions I have one item field, namely parent_url
, treated as metadata - that is, I do not mean to display it in my Scraper's output.
parent_url
is defined in the LinkItem
class, as shown below:
class LinkItem(Item):
name = Field()
url = Field()
parent_url = Field() # Identifies what URL this item was extracted from
status_code = Field()
In order to omit parent_url
from my Spider's output I've tried:
parent_url
in __init__
as an instance attribute - I got exceptions raised when trying to access it;self["parent_url"]
inside __init__
, but as already noted by the documentation Scrapy doesn't let assigning to undeclared fields;Field(serializer=None)
or Field(serializer=empty_function)
to parent_url
, which generated continuous exceptions while scraping and a JSON output with only commas.Not having yet come to a solution, I'm looking for external help. The parent_url
field/attribute is used internally within a pipeline, and I don't know what else to substitute it with.
Upvotes: 0
Views: 551
Reputation: 1061
BaseItemExporter
, the base abstract class of all built-in exporters, provides a fields_to_export attribute with the list of the field names to export. This is right-out-of the doc and I'm surprised I haven't noticed it before.
Upvotes: 0
Reputation: 3126
You can specify fields, which should be exported via FEED_EXPORT_FIELDS setting. For example:
# in `settings.py`
FEED_EXPORT_FIELDS = ['name', 'url', 'status_code']
Upvotes: 2