Why does query time decrease, but overall code time increase when implementing indexes?

Question

I loop through a number of messages with an unixtime as well as an userid, where I want to find the number of messages inside a 24 hr time slot for each user. I posted my code on codereview to get some help. From there I optimized the

cur.execute('SELECT unixtime FROM MessageType1 WHERE userID ='+str(userID[index])+' ORDER BY unixtime asc')

query as I found that it compromized of 6.7s of the total 7.2s running time of my code. I optimized it by making indexes on the unixtime column using CREATE INDEX unixtime_times ON MessageType1 (unixtime asc). Now that query takes 0.00117s opposed to 6.7s, but the overall time it takes to run my code has gone from 7.2s to 15.8s. Everything is unchanged except for the indexes. Upon further inspection, it seems like messages = cur.fetchall() is taking 15.3s after implementing indexes. Anyone have clue why? Thanks in advance!

con = lite.connect(databasepath)
    userID = []
    messages = []
    messageFrequency = []
    with con:
        cur = con.cursor()
        #Get all UserID
        cur.execute('SELECT DISTINCT userid FROM MessageType1')
        userID = cur.fetchall()
        userID = {index:x[0] for index,x in enumerate(userID)}
        #For each UserID
        for index in range(len(userID)):
            messageFrequency.append(0)
            #Get all MSG with UserID = UserID sorted by UNIXTIME
            cur.execute('SELECT unixtime FROM MessageType1 WHERE userID ='+str(userID[index])+' ORDER BY unixtime asc')
            messages = cur.fetchall()
            messages = {index:x[0] for index,x in enumerate(messages)}
            #Loop through every MSG
            for messageIndex in range(len(messages)):
                frequency = 0
                message = messages[messageIndex]
                for nextMessageIndex in range(messageIndex+1, len(messages)):
                #Loop through every message that is within 24 hours
                    nextmessage = messages[nextMessageIndex]
                    if  nextmessage < message+(24*60*60):
                    #Count the number of occurences
                        frequency += 1
                    else:
                        break
                #Add best benchmark for every message to a list that should be plotted.
                if messageFrequency[-1]

Gordon Linoff · Accepted Answer

The best index for this query:

SELECT unixtime
FROM MessageType1
WHERE userID ='+str(userID[index])+'
ORDER BY unixtime asc

is MessageType1(UserId, unixtime).

With an index on unixtime only, the database has basically two possible execution plans:

It can ignore the index, read the rows "sequentially", filter them, and then do the sort.
It can return the rows from the index, in sorted order, and then filter on output.

My guess is that it is choosing the second approach, based on your timings. The "fetch" component of the processing ends the execution of the query, so that is really fast. It then has to read the entire table to get the results you want.

This approach can take longer than just reading the data in order, because of locality issues. Without the index, it would read the first page and all the records on the first page. With the index, each record is on a random page -- no locality. This can be particularly problematic when the table is larger than memory available for the page cache, and you end up with a situation known as "thrashing".

Why does query time decrease, but overall code time increase when implementing indexes?

Answers (1)

Related Questions