hooliooo
hooliooo

Reputation: 548

Numbering Items in Scrapy

So I have an items.py with the following:

class ScrapyItem(scrapy.Item):
    source = scrapy.Field()
    link = scrapy.Field()

and the json output is:

[{"source": "Some source", "link":"www.somelink.com"},
 {"source": "Some source again", "link":"www.somelink.org"}]

is there a way change the output to:

[{"source1": "Some source", "link1":"www.somelink.com"},
 {"source2": "Some source again", "link2":"www.somelink.org"}]

From the docs, I saw you can manipulate the item values, can you do the same to the items themselves?

EDIT

Here's the new code I'm using for the output with an article_id item_field

article_id = [1]
def parse_common(self, response):
    feed = feedparser.parse(response.body)
    for entry_n, entry in enumerate(feed.entries, start=article_id[-1]):
        try:
            item = NewsbyteItem()
            item['source'] = response.url
            item['title'] = lxml.html.fromstring(entry.title).text
            item['link'] = entry.link
            item['description'] = entry.description
            item['article_id'] = '%d' % entry_n
            article_id.append(entry_n)
            request = Request(
                entry.link,
                callback=getattr(self, response.meta['method']),
                dont_filter=response.meta.get('dont_filter', False)
            )

            request.meta['item'] = item
            request.meta['entry'] = entry

            yield request
        except Exception as e:
            print '%s: %s' % (type(e), e)
            print entry

The problem is the entry_n restarts whenever it changes to another url. That's why the list was used.

Upvotes: 2

Views: 904

Answers (2)

miraculixx
miraculixx

Reputation: 10349

From the discussion

The purpose of the identifier is if an item has some data missing or includes data that isn't needed, I can find that dictionary right away and refactor the code accordingly.

With that purpose in mind, I'd suggest to generate UUIDs. Same effect, less hassle:

# item definition
class ScrapyItem(scrapy.Item):
    source = scrapy.Field()
    link = scrapy.Field()
    uuid = scrapy.Field()
# processing
def parse_common(self, response):
    ...
    item['uuid'] = uuid.uuid4()
    ...

Upvotes: 0

arodriguezdonaire
arodriguezdonaire

Reputation: 5562

I don't recommend you to identify different items changing the key of your item's values. You can do instead a dictionary with naming the responses doing something like:

output = [{"source": "Some source", "link":"www.somelink.com"}, {"source": "Some source again", "link":"www.somelink.org"}]
output_dict = {}
for counter, item in enumerate(output):
    output_dict['item' + str(counter + 1)] = item
print output_dict

Upvotes: 3

Related Questions