Marc C
Marc C

Reputation: 329

Scrapy python json output, clear file before writing

I am currently using Scrapy to gather data and output to a json file with

scrapy crawl foobar -a category=foo -o bar.json

Although this will append to the bar.json file rather than rewriting it. I would like to clear the file and rewrite over it, is this possible with a scrapy argument at all?

Or would I be required to clear it outwith scrapy first?

Many thanks.

Upvotes: 0

Views: 2230

Answers (6)

Frederick
Frederick

Reputation: 470

You can also add the line open(LOG_FILE, "w+").close() where LOG_FILE is the name of your log file in your settings.py. This opens, clears and closes it.

Upvotes: 0

Sanyam Khurana
Sanyam Khurana

Reputation: 1421

Overwriting feeds has been added to scrapy on Aug 17, 2020 with PR #4512. You can use -O flag to overwrite and the final command will look like this:

scrapy crawl foobar -a category=foo -O bar.json

Upvotes: 0

Vitaliy Vikhasty
Vitaliy Vikhasty

Reputation: 33

Modify script like following:

class MySpider(Spider):
    """
    Main crawler
    """
    name = "mucrawler"
    allowed_domains = ["sss.com"]
    start_urls = ["https://www.sdsd/rov/"]

    "Empty output file"
    f = open("bar.json", 'w').close()

    def parse(self, response):
        titles = response.css("td.offer")

Upvotes: 1

Vasim
Vasim

Reputation: 257

You can remove the output file first, then start crawling for new data using;

rm output_file_name.csv; scrapy crawl spider_name -o output_file_name.csv

Upvotes: 2

William Kinaan
William Kinaan

Reputation: 28799

In addition to what @GHaijba has said, another solution would be creating your own pipeline and then you can apply whatever actions you want to any file.

For example, You can check if the file exists. Then, you can clear it or append date to it.

You can write to different files.

You can clear some of your items in the pipeline as well, since it is not a good practice to do that in your spider

Upvotes: 0

GHajba
GHajba

Reputation: 3691

Currently there is no automated solution for this issue, although an open issue exists at GitHub about this topic.

This means you have to remove the file prior launching your crawl.

One workaround would be to write an item exporter which removes the output file when it is initialized (and export the items if you are already there).

Upvotes: 0

Related Questions