Reputation: 216
I have the following code to take data from a very large csv, import it into a Django model and convert it into nested categories (mptt model).
with open(path, "rt") as f:
reader = csv.reader(f, dialect="excel")
next(reader)
for row in reader:
if int(row[2]) == 0:
obj = Category(
cat_id=int(row[0]),
cat_name=row[1],
cat_parent_id=int(row[2]),
cat_level=int(row[3]),
cat_link=row[4],
)
obj.save()
else:
parent_id = Category.objects.get(cat_id=int(row[2]))
obj = Category(
cat_id=int(row[0]),
cat_name=row[1],
cat_parent_id=int(row[2]),
cat_level=int(row[3]),
cat_link=row[4],
parent=parent_id,
)
obj.save()
It takes over an hour to run this import. Is there a more efficient way to do this? I tried a bulk_create list comprehension, but found I need
parent_id = Category.objects.get(cat_id=int(row[2]))
to get mptt to nest correctly. Also, I think I have to save each insertion for mptt to properly update. I'm pretty sure the parent_id query is making the operation so slow as it has to query the database for every row in the CSV (over 27k rows).
The upside is this operation will only need to be done sporadically, but this dataset isn't so large that it should take over an hour to run.
Edit
Category Model
class Category(MPTTModel):
cat_id = models.IntegerField(null=False, name="cat_id")
cat_name = models.CharField(max_length=100,
null=False,
name="cat_name",
default="")
cat_parent_id = models.IntegerField(null=False, name="cat_parent_id")
cat_level = models.IntegerField(null=False, name="cat_level")
cat_link = models.URLField(null=True, name="cat_link")
parent = TreeForeignKey("self",
on_delete=models.CASCADE,
null=True,
blank=True,
related_name="children")
def __str__(self):
return f"{self.cat_name}"
class MPTTMeta:
order_insertion_by = ["cat_name"]
Edit 2
cat_parent_id = int(row[2])
obj = Category(
cat_id=int(row[0]),
cat_name=row[1],
cat_parent_id=int(row[2]),
cat_level=int(row[3]),
cat_link=row[4],
parent={cat_parent_id}_id,
)
Upvotes: 0
Views: 249