chesneybrown
chesneybrown

Reputation: 55

TypeError when putting scraped data from scrapy into elasticsearch

I've been following this tutorial (http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html) and using this scrapy elasticsearch pipeline (https://github.com/knockrentals/scrapy-elasticsearch) and am able to extract data from scrapy to a JSON file and have an elasticsearch server up and running on localhost.

However, when I attempt to send scraped data into elasticsearch using the pipeline, I get the following error:

2015-08-05 21:21:53 [scrapy] ERROR: Error processing {'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/221907250/'],
 'title': [u'Alles rund um Elasticsearch']}
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 70, in process_item
    self.index_item(item)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapyelasticsearch/scrapyelasticsearch.py", line 52, in index_item
    local_id = hashlib.sha1(item[uniq_key]).hexdigest()
TypeError: must be string or buffer, not list

my items.py scrapy file looks like this:

from scrapy.item import Item, Field

class MeetupItem(Item):
    title = Field()
    link = Field()
    description = Field()

and (i think only the relevant part of) my settings.py file looks like this:

from scrapy import log

ITEM_PIPELINES = [
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline',
]

ELASTICSEARCH_SERVER = 'localhost' # If not 'localhost' prepend 'http://'
ELASTICSEARCH_PORT = 9200 # If port 80 leave blank
ELASTICSEARCH_USERNAME = ''
ELASTICSEARCH_PASSWORD = ''
ELASTICSEARCH_INDEX = 'meetups'
ELASTICSEARCH_TYPE = 'meetup'
ELASTICSEARCH_UNIQ_KEY = 'link'
ELASTICSEARCH_LOG_LEVEL= log.DEBUG

any help would be greatly appreciated!

Upvotes: 3

Views: 817

Answers (1)

GHajba
GHajba

Reputation: 3691

As you can see in the error message: Error processing {'link': [u'http://www.meetup.com/Search-Meetup-Karlsruhe/events/221907250/'], 'title': [u'Alles rund um Elasticsearch']} your item's link and title fields are lists (the square brackets around the values indicate this).

This is because of your extraction in Scrapy. You did not post it with your question but you should use response.xpath().extract()[0] to get the first result of the list. Naturally in this case you should prepare to encounter empty result sets to avoid index-errors.

Update

For the situation where you do not extract anything you could prepare with the following:

linkSelection = response.xpath().extract()
item['link'] = linkSelection[0] if linkSelection else ""

Or something alike depending on your data and fields. Perhaps None could be valid too if the list is empty.

The basic idea is to split up XPath extraction and list-item selection. And you should select an item from the list if it contains the required elements.

Upvotes: 2

Related Questions