Reputation: 2328
I loop through a number of messages with an unixtime as well as an userid, where I want to find the number of messages inside a 24 hr time slot for each user. I posted my code on codereview to get some help. From there I optimized the
cur.execute('SELECT unixtime FROM MessageType1 WHERE userID ='+str(userID[index])+' ORDER BY unixtime asc')
query as I found that it compromized of 6.7s of the total 7.2s running time of my code. I optimized it by making indexes on the unixtime column using CREATE INDEX unixtime_times ON MessageType1 (unixtime asc)
. Now that query takes 0.00117s opposed to 6.7s, but the overall time it takes to run my code has gone from 7.2s to 15.8s. Everything is unchanged except for the indexes. Upon further inspection, it seems like messages = cur.fetchall()
is taking 15.3s after implementing indexes. Anyone have clue why? Thanks in advance!
con = lite.connect(databasepath)
userID = []
messages = []
messageFrequency = []
with con:
cur = con.cursor()
#Get all UserID
cur.execute('SELECT DISTINCT userid FROM MessageType1')
userID = cur.fetchall()
userID = {index:x[0] for index,x in enumerate(userID)}
#For each UserID
for index in range(len(userID)):
messageFrequency.append(0)
#Get all MSG with UserID = UserID sorted by UNIXTIME
cur.execute('SELECT unixtime FROM MessageType1 WHERE userID ='+str(userID[index])+' ORDER BY unixtime asc')
messages = cur.fetchall()
messages = {index:x[0] for index,x in enumerate(messages)}
#Loop through every MSG
for messageIndex in range(len(messages)):
frequency = 0
message = messages[messageIndex]
for nextMessageIndex in range(messageIndex+1, len(messages)):
#Loop through every message that is within 24 hours
nextmessage = messages[nextMessageIndex]
if nextmessage < message+(24*60*60):
#Count the number of occurences
frequency += 1
else:
break
#Add best benchmark for every message to a list that should be plotted.
if messageFrequency[-1]<frequency:
messageFrequency[-1] = frequency
Upvotes: 0
Views: 66
Reputation: 1271003
The best index for this query:
SELECT unixtime
FROM MessageType1
WHERE userID ='+str(userID[index])+'
ORDER BY unixtime asc
is MessageType1(UserId, unixtime)
.
With an index on unixtime
only, the database has basically two possible execution plans:
My guess is that it is choosing the second approach, based on your timings. The "fetch" component of the processing ends the execution of the query, so that is really fast. It then has to read the entire table to get the results you want.
This approach can take longer than just reading the data in order, because of locality issues. Without the index, it would read the first page and all the records on the first page. With the index, each record is on a random page -- no locality. This can be particularly problematic when the table is larger than memory available for the page cache, and you end up with a situation known as "thrashing".
Upvotes: 2