user3595632
user3595632

Reputation: 5730

Python: What should I use multi-process or multi-thread in DB related tasks?

This thread explains what the CPU bound, IO bound problems.

Given that the Python has GIL structure, someone recommended,

• Use threads for I/O bound problems
• Use processes, networking, or events (discussed in the next section) for CPU-bound problems

Honestly, I cannot fully and intuitively understand what these problems really are.

Here is the situation I'm faced with:

crawl_item_list = [item1, item2, item3 ....]
for item in crawl_item_list:
    crawl_and_save_them_in_db(item)

def crawl_and_save_them_in_db(item):
    # Pseudo code    
    # crawled_items = crawl item from the web  // the number of crawled_items  is usually 200
    # while crawled_items:
    #    save them in PostgreSQL DB
    #    crawled_items = crawl item from the web    

This is the task that I want to perform with parallel processes or thread. (Each process(or thread) will have their own crawl_and_save_them_in_db and deals with each item)

In this case, which one should I choose between multi-processes(something like Pool) and multi-thread?

I think that since the main job of this task is storing the DB, which is kind of IO bound task(Hope it is..), so I have to use multi thread? Am I right?

Need your advices.

Upvotes: 2

Views: 319

Answers (1)

KenanBek
KenanBek

Reputation: 1067

It depends on amount of data which will be stored.

If its millions of records then I strongly recommend to use multiprocessing approach. It maybe python's builtin multiprocessing or 3rd party package.

If data is more lightweight then go with threading or even you can try gevent.

In my crawling project I first started with threads, then moved to gevent because it was more easy to support. After my data become millions of records, the part which was responsible for storing mass amount of data moved to separate multiprocessing module with inner threads. It is a little bit uncomfortable to support and improve it but the process which worked hours now takes 5-10 minutes.

Upvotes: 2

Related Questions