user1157751
user1157751

Reputation: 2457

Django - Getting/Saving large objects takes a lot of time

I'm trying to get a few million of items from a model, and parsing them. However, somehow it spends a lot of time trying to get the data saved.

These are the current models that I have:

class mapt(models.Model):
    s = models.IntegerField(primary_key=True)
    name = models.CharField(max_length=2000)

    def __unicode__(self):
        return str(self.s)

class datt(models.Model):
    s = models.IntegerField(primary_key=True)
    setid = models.IntegerField()
    var = models.IntegerField()
    val = models.IntegerField()

    def __unicode(self):
        return str(self.s)

class sett(models.Model):
    setid = models.IntegerField(primary_key=True)
    block = models.IntegerField()
    username = models.IntegerField()
    ts = models.IntegerField()

    def __unicode__(self):
        return str(self.setid)

class data_parsed(models.Model):
    setid = models.IntegerField(max_length=2000, primary_key=True)
    block = models.CharField(max_length=2000)
    username = models.CharField(max_length=2000)
    data = models.CharField(max_length=200000)
    time = models.IntegerField()

    def __unicode__(self):
        return str(self.setid)

The s parameter for the datt model should actually act as a foreign key to mapt's s parameter. Furthermore, sett's setid field should act as a foreign key to setid's setid.

Lastly, data_parsed's setid is a foreign key to sett's models.

The algorithm is currently written this way:

def database_rebuild(start_data_parsed):
    listSetID = []
    #Part 1
    for items in sett.objects.filter(setid__gte=start_data_parsed):
        listSetID.append(items.setid)
    uniqueSetID = listSetID 

    #Part 2
    for items in uniqueSetID:
        try:
            SetID = items
            settObject = sett.objects.get(setid=SetID)

            UserName = mapt.objects.get(pk=settObject.username).name
            TS = pk=settObject.ts
            BlockName = mapt.objects.get(pk=settObject.block).name

            DataPairs_1 = []
            DataPairs_2 = []
            DataPairs_1_Data = []
            DataPairs_2_Data = []

            for data in datt.objects.filter(setid__exact=SetID):
                DataPairs_1.append(data.var)
                DataPairs_2.append(data.val)

            for Data in DataPairs_1:
                DataPairs_1_Data.append(mapt.objects.get(pk=Data).name)

            for Data in DataPairs_2:
                DataPairs_2_Data.append(mapt.objects.get(pk=Data).name)

            assert (len(DataPairs_1) == len(DataPairs_2)), "Length not equal"

            #Part 3
            Serialize = []
            for idx, val in enumerate(DataPairs_1_Data):
                Serialize.append(str(DataPairs_1_Data[idx]) + ":PARSEABLE:" + str(DataPairs_2_Data[idx]) + ":PARSEABLENEXT:")


            Serialize_Text = ""
            for Data in Serialize:
                Serialize_Text += Data


            Data = Serialize_Text
            p = data_parsed(SetID,  BlockName, UserName, Data, TS)
            p.save()
        except AssertionError, e:
            print "Error:" + str(e.args)
            print "Possibly DataPairs does not have equal length"
        except Exception as e:
            print "Error:" + str(sys.exc_info()[0])
            print "Stack:" + str(e.args)

Basically, what it does is that:

  1. Finds all sett objects that is greater than a number

  2. Gets the UserName, TS, and BlockName, then get all the fields in datt field that correspond to a var and val field maps to the mapt 's' field. Var and Val is basically NAME_OF_FIELD:VALUE type of relationship.

  3. Serialize all the var and val parameters so that I could get all the parameters from var and val that is spread across the mapt table in a row in data_parsed.

The current solution does everything I would like to, however, on a Intel Core i5-4300U CPU @ 1.90Ghz, it parses around 15000 rows of data daily on a celery periodic worker. I have around 3355566 rows of data at my sett table, and it will take around ~23 days to parse them all.

Is there a way to speed up the process?

============================Updated============================

New Models:

class mapt(models.Model):
    s = models.IntegerField(primary_key=True)
    name = models.CharField(max_length=2000)

    def __unicode__(self):
        return str(self.s)

class sett(models.Model):
    setid = models.IntegerField(primary_key=True)
    block = models.ForeignKey(mapt, related_name='sett_block')
    username = models.ForeignKey(mapt, related_name='sett_username')
    ts = models.IntegerField()

    def __unicode__(self):
        return str(self.setid)

# class sett(models.Model):
    # setid = models.IntegerField(primary_key=True)
    # block = models.IntegerField()
    # username = models.IntegerField()
    # ts = models.IntegerField()

    # def __unicode__(self):
        # return str(self.setid)

class datt(models.Model):
    s = models.IntegerField(primary_key=True)
    setid = models.ForeignKey(sett, related_name='datt_setid')
    var = models.ForeignKey(mapt, related_name='datt_var')
    val = models.ForeignKey(mapt, related_name='datt_val')

    def __unicode(self):
        return str(self.s)

# class datt(models.Model):
    # s = models.IntegerField(primary_key=True)
    # setid = models.IntegerField()
    # var = models.IntegerField()
    # val = models.IntegerField()

    # def __unicode(self):
        # return str(self.s)

class data_parsed(models.Model):
    setid = models.ForeignKey(sett, related_name='data_parsed_setid', primary_key=True)
    block = models.CharField(max_length=2000)
    username = models.CharField(max_length=2000)
    data = models.CharField(max_length=2000000)
    time = models.IntegerField()

    def __unicode__(self):
        return str(self.setid)

New Parsing:

def database_rebuild(start_data_parsed, end_data_parsed):
    for items in sett.objects.filter(setid__gte=start_data_parsed, setid__lte=end_data_parsed):
        try:
            UserName = mapt.objects.get(pk=items.username_id).name
            TS = pk=items.ts
            BlockName = mapt.objects.get(pk=items.block_id).name

            DataPairs_1 = []
            DataPairs_2 = []
            DataPairs_1_Data = []
            DataPairs_2_Data = []

            for data in datt.objects.filter(setid_id__exact=items.setid):
                DataPairs_1.append(data.var_id)
                DataPairs_2.append(data.val_id)

            for Data in DataPairs_1:
                DataPairs_1_Data.append(mapt.objects.get(pk=Data).name)

            for Data in DataPairs_2:
                DataPairs_2_Data.append(mapt.objects.get(pk=Data).name)

            assert (len(DataPairs_1) == len(DataPairs_2)), "Length not equal"

            Serialize = []
            for idx, val in enumerate(DataPairs_1_Data):
                Serialize.append(str(DataPairs_1_Data[idx]) + ":PARSEABLE:" + str(DataPairs_2_Data[idx]))

            Data = ":PARSEABLENEXT:".join(Serialize)
            p = data_parsed(items.setid, BlockName, UserName, Data, TS)
            p.save()
        except AssertionError, e:
            print "Error:" + str(e.args)
            print "Possibly DataPairs does not have equal length"
        except Exception as e:
            print "Error:" + str(sys.exc_info()[0])
            print "Stack:" + str(e.args)

Upvotes: 0

Views: 576

Answers (1)

Defining lists by appending repeadedly is very slow. Use list comprehensions or even just the list() constructor.

In python you should not join a list of strings using for loops and +=, you should use join().

But that is not the primary bottleneck here. You have a lot of objects.get()s which each takes a database roundtrip. If you didn't have milions of rows in the mapt table, you should probably just make a dictionary mapping mapt primary keys to mapt objects.

Had you defined your foreign keys as foreign keys the django orm could help you do much of this in like five queries in total. That is, instead of SomeModel.objects.get(id=some_instance.some_fk_id) you can do some_instance.some_fk (which will only hit the databse the first time you do it for each instance). You can then even get rid of the foreign key query if some_instance had been initialized as some_instance = SomeOtherModel.objects.select_related('some_fk').get(id=id_of_some_instance).

Perhaps changing the models without changing the database will work.

Upvotes: 1

Related Questions