Brian Williams
Brian Williams

Reputation: 125

scrapy django update database with scraped data

I currently have scrapy spiders that scrape XML feeds and store the information in a postgres database using django models.

This all works perfectly and gets the exact information I want. The problem is the database needs to be updated once every day with new information, changed information or delete information that is no longer there.

So basically when the spider runs I want it to check if it is already in the database, if it is the exact same information then ignore it, if the information has changed the change it or if it no longer exists, delete the information.

I just can't seem to figure out how to do this. Any ideas would be greatly appreciated.

Brian

Upvotes: 1

Views: 2286

Answers (1)

Zartch
Zartch

Reputation: 1025

possible duplicate of How to update DjangoItem in Scrapy

NT3RP give a great solution to update all django models in just one pipeline and a few functions.

You can populate a "false" primary key constructed from the data of the object.Then u can save the data and update-it in the model if is already scraped in only one pipeline:

class ItemPersistencePipeline(object):
    def process_item(self, item, spider):
        try:
             item_model = item_to_model(item)
        except TypeError:
            return item   
        model, created = get_or_create(item_model)
        try:
            update_model(model, item_model)
        except Exception,e:
            return e
        return item

of course the methods:

def item_to_model(item):
    model_class = getattr(item, 'django_model')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")   
    return item.instance   

def get_or_create(model):
    model_class = type(model)
    created = False
    try:
        #We have no unique identifier at the moment
        #use the model.primary for now
        obj = model_class.objects.get(primary=model.primary)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.

    return (obj, created)

from django.forms.models import model_to_dict

def update_model(destination, source, commit=True):
    pk = destination.pk

    source_dict = model_to_dict(source)
    for (key, value) in source_dict.items():
        setattr(destination, key, value)

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()

    return destination

Also you should define the Field "primary" in the models of django to search if is already in the new item scraped

models.py

class Parent(models.Model):
    field1 = CharField()   
    #primary_key=True
    primary = models.CharField(max_length=80)
class ParentX(models.Model):
    field2 = CharField()
    parent = models.OneToOneField(Parent, related_name = 'extra_properties')
    primary = models.CharField(max_length=80) 
class Child(models.Model):
    field3 = CharField()
    parent = models.ForeignKey(Parent, related_name='childs')
    primary = models.CharField(max_length=80)

Upvotes: 1

Related Questions