Reputation: 17152
So I tried to simplify the problem as much as possible. I know this looks like a wall of text but I try to give you a bit of context. I created a simple sqlite database with over 6 millions row. the test table look like this:
class Backlog(Base):
__tablename__ = "files"
id = Column(Integer, primary_key=True)
filename = Column(String)
date = Column(Date)
I am trying to see if its possible to multi-thread calls at the same time so I get a faster I/O.
After reading the docs and checking SO, I decided to use scoped_session
. the fetching code look like this:
from sqlalchemy import *
from sqlalchemy.orm import sessionmaker, scoped_session, Session
import queue
import threading
import time
def get_sql_proc(ThreadSession):
some_session = ThreadSession()
sql = text('select * from files')
call = some_session.execute(sql)
res = call.fetchall()
ThreadSession.remove()
return res
def add_func_to_thread(q, func, kwargs):
q.put(func(**kwargs))
engine = create_engine('sqlite:///christmas.db', echo=False)
session_factory = sessionmaker(bind=engine)
ThreadSession = scoped_session(session_factory)
q = queue.Queue()
st = time.time()
threads = list()
for i in range(1, 4):
t = threading.Thread(target=add_func_to_thread,
args=(q, get_sql_proc, {'ThreadSession': ThreadSession}))
t.daemon = True
t.start()
threads.append(t)
for x in threads:
x.join()
print(time.time()-st)
this get a runtime of about 88 seconds. but when I simply call them one after the other, I get a runtime of about 27 seconds:
engine = create_engine('sqlite:///christmas.db', echo=False)
session = Session(engine)
st = time.time()
for i in range(1, 4):
w = session.execute('select * from files').fetchall()
print(time.time()-st)
I did the same test on a MS SQL Server and was getting similar result. I am very confuse on what I am doing wrong and why is it so much slower with the multi-thread approach, any tips or tricks would be appreciated
Upvotes: 1
Views: 187
Reputation: 348
Python has a Global Interpreter Lock so only one operation will actually be occurring at any one given time. The Python threading module provides concurrency, but not parallelism.
As a very simplistic analogy, you could think of having two decks of cards in front of you, and you want to look at every card in both decks. The threading module presents itself as though it picks a card from each deck and reads both simultaneously, but behind the scenes it just shuffles the decks together and looks at the cards one at a time.
Without knowing many intimate details, I would guess that the performance is not as expected, because the "threaded" calls are actually happening serially, the extra overhead of the threading, in this case, only acts as a performance detriment.
Python provides a different method for parallelism in the multiprocessing
module. It spawns (or is capable or spawning depending on use) a new interpreter process which runs the desired function in parallel. Sharing memory between processes is somewhat more complicated than with threads though, so it's NOT likely to be as simple as changing threading.Thread
to multiprocessing.Process
.
Upvotes: 1