Updating django models with multiprocessing pool locks up database

Question

I use Jupyter Notebook to play with the data that I store in django/postgres. I initialize my project this way:

sys.path.append('/srv/gr/prg')
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'prg.settings')
if 'setup' in dir(django):
    django.setup()

There are many individual processes that update the data and I wanted to multithread it to speed up the process. Everything works well when I do updates in a single thread or use sqlite.

def extract_org_description(id):
    o = models.Organization.objects.get(pk=id)
    logging.info("Looking for description for %s" % o.symbol)
    try:
        content = open('/srv/data/%s.html' % o.symbol)
    except FileNotFoundError:
        logging.error("HTML file not found for %s" % o.symbol)
        return
    doc = BeautifulSoup(content, 'html.parser')
    desc = doc.select("#cr_description_mod > div.cr_expandBox > div.cr_description_full.cr_expand")
    if not desc or not desc[0]:
        logging.info("Cannot not find description for %s" % o.symbol)
        return
    o.description = desc[0].text
    o.save(update_fields=['description'])
    logging.info("Description for %s found" % o.symbol)
    return("done %s" % id)

And this will not work:

p = Pool(2)
result = p.map(extract_org_description, orgs)
print(result)

Most of the time, it will hang until I've interrupted it, without any particular error, sometimes postgres will have "There is already a transaction in progress", sometimes I see "No Results to fetch" error. Playing with the pool size I could make it work maybe once or twice but it's hard to diagnose what exactly the issue is.

I tried changing the strategy to selecting the objects and mapping them to the extract_org_description that would take the object as the parameter (unlike selecting based on the keys), but this does not work any better.

The only thought I have is that when django is trying to autocommit, all of the individual updates, including the ones that are happening in other threads are in the same transaction scope and this is causing the issue. But I don't understand how to fix this in django.

Kevin Christopher Henry · Accepted Answer

Your question includes the terms multiprocessing and multithread, but it's important to understand that these are different ways of achieving concurrency.

Django has built-in support for multithreading, and will create a new database connection for each thread. If you switch from multiprocessing to multithreading your problem should be solved.

In multiprocessing, the entire process is forked and the new process will have the same database connection as the old one. That results in the problems you're seeing, where, for example, you try to open a new transaction when one has already been opened on the same database connection by another process.

If you truly need multiprocessing instead of multithreading, there are probably solutions. For example, this answer suggests simply closing the database connections, forcing Django to create fresh ones.

Updating django models with multiprocessing pool locks up database

Answers (1)

Related Questions