As a learning experiment for familiarizing with Scrapy I'm writing a Scraper which checks all the links of a HTML page and reports the status codes of HTTP HEAD requests directed to them. Fact is, in one of my item definitions I have one item field, namely parent_url , treated as metadata - that is, I do not mean to display it in my Scraper's output. parent_url is defined in the LinkItem class, as shown below: class LinkItem(Item): name = Field() url = Field() parent_url = Field() # Identifies what URL this item was extracted from status_code = Field() In order to omit parent_url from my Spider's output I've tried: Defining parent_url in __init__ as an instance attribute - I got exceptions raised when trying to access it; Assigning to self["parent_url"] inside __init__ , but as already noted by the documentation Scrapy doesn't let assigning to undeclared fields; Assigning Field(serializer=None) or Field(serializer=empty_function) to parent_url , which generated continuous exceptions while scraping and a JSON output with only commas. Not having yet come to a solution, I'm looking for external help. The parent_url field/attribute is used internally within a pipeline, and I don't know what else to substitute it with.

Reputation: 1061

How can I instruct Scrapy to not serialize an item field?

As a learning experiment for familiarizing with Scrapy I'm writing a Scraper which checks all the links of a HTML page and reports the status codes of HTTP HEAD requests directed to them. Fact is, in one of my item definitions I have one item field, namely parent_url, treated as metadata - that is, I do not mean to display it in my Scraper's output.

parent_url is defined in the LinkItem class, as shown below:

class LinkItem(Item):
    name = Field()
    url = Field()
    parent_url = Field()   # Identifies what URL this item was extracted from
    status_code = Field()

In order to omit parent_url from my Spider's output I've tried:

Defining parent_url in __init__ as an instance attribute - I got exceptions raised when trying to access it;
Assigning to self["parent_url"] inside __init__, but as already noted by the documentation Scrapy doesn't let assigning to undeclared fields;
Assigning Field(serializer=None) or Field(serializer=empty_function) to parent_url, which generated continuous exceptions while scraping and a JSON output with only commas.

Not having yet come to a solution, I'm looking for external help. The parent_url field/attribute is used internally within a pipeline, and I don't know what else to substitute it with.

Upvotes: 0

Answers (2)

Acsor

Reputation: 1061

BaseItemExporter, the base abstract class of all built-in exporters, provides a fields_to_export attribute with the list of the field names to export. This is right-out-of the doc and I'm surprised I haven't noticed it before.

Upvotes: 0

Eugene V

Reputation: 3126

You can specify fields, which should be exported via FEED_EXPORT_FIELDS setting. For example:

# in `settings.py`
FEED_EXPORT_FIELDS = ['name', 'url', 'status_code']

Upvotes: 2

How can I instruct Scrapy to not serialize an item field?

Answers (2)

Related Questions