Reputation: 25
I am crawling websites using Scrapy and then sending that data to Solr to be indexed. The data is being sent through an item pipeline that uses one of Solr's Python client's--mysolr.
The spider works correctly and my items array has two items with the correct fields. This array is called by the process_item function in the pipeline.
Item Pipeline
from mysolr import Solr
class SolrPipeline(object):
def __init__(self):
self.client = Solr('http://localhost:8983/solr', version=4)
response = self.client.search(q='Title')
print response
def process_item(self, item, spider):
docs = [
{'title' : item["title"],
'subtitle' : item["subtitle"]
},
{'title': item["title"],
'subtitle': item["subtitle"]
}
]
print docs
self.client.update(docs, 'json', commit=False)
self.client.commit()
This is where I get my problem. The response that gets printed is < SolrResponse status=404 >. I used the SOLR_URL that appears whenever I launch the Admin UI of Solr.
Another error I get is below.
2015-08-25 09:06:53 [urllib3.connectionpool] INFO: Starting new HTTP connection (1): localhost
2015-08-25 09:06:53 [urllib3.connectionpool] DEBUG: Setting read timeout to None
2015-08-25 09:06:53 [urllib3.connectionpool] DEBUG: "POST /update/json HTTP/1.1" 404 1278
2015-08-25 09:06:53 [urllib3.connectionpool] INFO: Starting new HTTP connection (1): localhost
2015-08-25 09:06:53 [urllib3.connectionpool] DEBUG: Setting read timeout to None
2015-08-25 09:06:53 [urllib3.connectionpool] DEBUG: "POST /update HTTP/1.1" 404 1273
The six lines appear twice (once for each item I am trying to add I presume).
Upvotes: 1
Views: 492
Reputation: 9849
You want to do a POST request with JSON
data, but in fact passing a Python list of dictionaries to the self.client.update()
method.
Convert the Python list of dictionaries to JSON:
import json
from mysolr import Solr
class SolrPipeline(object):
def __init__(self):
self.client = Solr('http://localhost:8983/solr', version=4)
response = self.client.search(q='Title')
print response
def process_item(self, item, spider):
docs = [
{'title' : item["title"],
'subtitle' : item["subtitle"]
},
{'title': item["title"],
'subtitle': item["subtitle"]
}
]
docs = json.dumps(docs) # convert to JSON
self.client.update(docs, 'json', commit=False)
self.client.commit()
Upvotes: 1