Reputation: 482
I have a bunch of dataframes stored in a dictionary ('df_dict'). Each dataframe has 100 rows. For each row I need to add the columns ('new_score', 'new_num_comments', 'upvote_ratio') with the current data from Reddit. I am using PRAW to access the Reddit API.
Since it takes so long to update each row sequentially, i am trying to use multithreading to get the data. So I take a dataframe with 100 rows, start a thread for each and spawn an instance of PRAW for each thread.
Somehow my code seems to work and update the rows, but it takes way too long - awefully long. No difference to updating it sequentially. It takes almost 11.4 seconds to to update one row with my "multithreading" attempt. While it takes 0.2 seconds if I do this sequentially. What am I doing wrong???
Here is my code. I tried to cut out as much as I could and obviously redacted my credentials:
from threading import Thread, Lock
import praw
# Dataframes are stored in df_dict
mutex = Lock()
threads = []
class ReqThread (Thread):
def __init__(self, threadID, index, row):
Thread.__init__(self)
self.threadID = threadID
self.index = index
self.row = row
def run(self):
print("Starting %s" % self.threadID)
for row in self.row:
worker(index=self.index, row=self.row)
print("Exiting %s" % self.threadID)
def make_reddit():
return praw.Reddit(client_id=client_id, client_secret=client_secret, username=username, password=password, user_agent=user_agent)
def worker(index, row):
global df
print('Request-ID: %s' % row['id'])
reddit = make_reddit()
submission = reddit.submission(row['id'])
mutex.acquire()
df.at[index, 'new_score'] = submission.score
df.at[index, 'upvote_ratio'] = submission.upvote_ratio
df.at[index, 'new_num_comments'] = submission.num_comments
mutex.release()
for i in df_dict:
df = df_dict[i]
for index, row in df.iterrows():
t = ReqThread(threadID=index, index=index, row=row)
t.start()
threads.append(t)
for thread in threads:
thread.join()
df.to_csv('u_{i}.csv'.format(i=i))
EDIT: Calculated again how much my "multithreading" takes.
Upvotes: 2
Views: 147
Reputation: 26
What you seem to be running into is the Python Threading catch.
Why are these 'threads' slower?
Despite what it may seem like Python threads are not actual threads, the _thread and threading module use OS threads which are great for IO-bound concurrency but less so for CPU-bound tasks, this comes down to Python and the GIL (Global Interpreter Lock) to keep everything threadsafe.
Because you don't get the advantage of multiple cores you're only going to get the overhead of balancing multiple OS threads around.
So how do I do multi-core processing in Python?
To get around the GIL issue python makes use of subprocesses to essentially balance the load and make use of multiple cores, modules like multiprocessing make use of the subprocess module internally, You can try this for yourself if you make a process pool of 4 workers and start them, notice how it spawns 4 python processes as well?
Important things to note
Upvotes: 1