Reputation: 125
I currently have scrapy spiders that scrape XML feeds and store the information in a postgres database using django models.
This all works perfectly and gets the exact information I want. The problem is the database needs to be updated once every day with new information, changed information or delete information that is no longer there.
So basically when the spider runs I want it to check if it is already in the database, if it is the exact same information then ignore it, if the information has changed the change it or if it no longer exists, delete the information.
I just can't seem to figure out how to do this. Any ideas would be greatly appreciated.
Brian
Upvotes: 1
Views: 2286
Reputation: 1025
possible duplicate of How to update DjangoItem in Scrapy
NT3RP give a great solution to update all django models in just one pipeline and a few functions.
You can populate a "false" primary key constructed from the data of the object.Then u can save the data and update-it in the model if is already scraped in only one pipeline:
class ItemPersistencePipeline(object):
def process_item(self, item, spider):
try:
item_model = item_to_model(item)
except TypeError:
return item
model, created = get_or_create(item_model)
try:
update_model(model, item_model)
except Exception,e:
return e
return item
of course the methods:
def item_to_model(item):
model_class = getattr(item, 'django_model')
if not model_class:
raise TypeError("Item is not a `DjangoItem` or is misconfigured")
return item.instance
def get_or_create(model):
model_class = type(model)
created = False
try:
#We have no unique identifier at the moment
#use the model.primary for now
obj = model_class.objects.get(primary=model.primary)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
return (obj, created)
from django.forms.models import model_to_dict
def update_model(destination, source, commit=True):
pk = destination.pk
source_dict = model_to_dict(source)
for (key, value) in source_dict.items():
setattr(destination, key, value)
setattr(destination, 'pk', pk)
if commit:
destination.save()
return destination
Also you should define the Field "primary" in the models of django to search if is already in the new item scraped
models.py
class Parent(models.Model):
field1 = CharField()
#primary_key=True
primary = models.CharField(max_length=80)
class ParentX(models.Model):
field2 = CharField()
parent = models.OneToOneField(Parent, related_name = 'extra_properties')
primary = models.CharField(max_length=80)
class Child(models.Model):
field3 = CharField()
parent = models.ForeignKey(Parent, related_name='childs')
primary = models.CharField(max_length=80)
Upvotes: 1