How to update dataframes with PRAW (Reddit) and multithreading

Question

I have a bunch of dataframes stored in a dictionary ('df_dict'). Each dataframe has 100 rows. For each row I need to add the columns ('new_score', 'new_num_comments', 'upvote_ratio') with the current data from Reddit. I am using PRAW to access the Reddit API.

Since it takes so long to update each row sequentially, i am trying to use multithreading to get the data. So I take a dataframe with 100 rows, start a thread for each and spawn an instance of PRAW for each thread.

Somehow my code seems to work and update the rows, but it takes way too long - awefully long. No difference to updating it sequentially. It takes almost 11.4 seconds to to update one row with my "multithreading" attempt. While it takes 0.2 seconds if I do this sequentially. What am I doing wrong???

Here is my code. I tried to cut out as much as I could and obviously redacted my credentials:

from threading import Thread, Lock
import praw

# Dataframes are stored in df_dict

mutex = Lock()

threads = []

class ReqThread (Thread):
    def __init__(self, threadID, index, row):                                
        Thread.__init__(self) 
        self.threadID = threadID 
        self.index = index
        self.row = row

    def run(self): 
        print("Starting %s" % self.threadID)
        for row in self.row:
            worker(index=self.index, row=self.row)
        print("Exiting %s" % self.threadID) 
        
        
def make_reddit():
    return praw.Reddit(client_id=client_id, client_secret=client_secret, username=username, password=password, user_agent=user_agent)
        
def worker(index, row):
    global df
    print('Request-ID: %s' % row['id'])
    reddit = make_reddit()
    submission = reddit.submission(row['id'])
    mutex.acquire() 
    df.at[index, 'new_score'] = submission.score
    df.at[index, 'upvote_ratio'] = submission.upvote_ratio
    df.at[index, 'new_num_comments'] = submission.num_comments
    mutex.release()

for i in df_dict:
    df = df_dict[i]
     
    for index, row in df.iterrows():
        t = ReqThread(threadID=index, index=index, row=row)
        t.start()
        threads.append(t)
        
    for thread in threads:
        thread.join()

    df.to_csv('u_{i}.csv'.format(i=i))

EDIT: Calculated again how much my "multithreading" takes.

Harrison Burt · Accepted Answer

What you seem to be running into is the Python Threading catch.

Why are these 'threads' slower?
Despite what it may seem like Python threads are not actual threads, the _thread and threading module use OS threads which are great for IO-bound concurrency but less so for CPU-bound tasks, this comes down to Python and the GIL (Global Interpreter Lock) to keep everything threadsafe. Because you don't get the advantage of multiple cores you're only going to get the overhead of balancing multiple OS threads around.

So how do I do multi-core processing in Python?
To get around the GIL issue python makes use of subprocesses to essentially balance the load and make use of multiple cores, modules like multiprocessing make use of the subprocess module internally, You can try this for yourself if you make a process pool of 4 workers and start them, notice how it spawns 4 python processes as well?

Important things to note

Threading Modules in python are actually OS thread systems normally.
The GIL limits a lot of parallel processing in python.
IPC can be much harder because you do not get the ability to free pass variables across the processes. See MultiProcessing's Documentation for more info.

How to update dataframes with PRAW (Reddit) and multithreading

Answers (1)

Related Questions