Reputation: 1013
I have been fighting with this for a few weeks now -- as you can see here and here. Although the answers I got there are correct, they don't work for me because of performance issues. More below.
I am using SQLite. I have a table where rows represent individuals in a point in time. I am trying create a column (statusLag
) with the lagged values of another column (status
), like so:
workerID timeVar status statusLag
---------------------------------------------
1 1 0 NULL
1 2 1 0
1 3 1 1
1 4 1 1
----------------------------------------------
2 3 1 NULL
2 4 0 1
2 5 1 0
2 6 0 1
UPDATE myTable
SET statusLag = (SELECT t2.status
FROM myTable t2
WHERE t2.workerID = myTable.workerID AND
t2.timeVar < myTable.timeVar
ORDER BY t2.timeVar desc
LIMIT 1
);
This does the job in the tests I ran. However, I have a very large table -- 3.2 billion rows. I could work for now with one table of 300 million rows, this query uses way too much RAM, more than what my machine (12GB) can handle. So, for all (my) practical purposes, this doesn't work.
It seems to me that I could do a JOIN. Something along the lines of
SELECT t1.*, t2.status as statusLag FROM myTable AS t1
LEFT JOIN myTable AS t2
ON t1.workerID=t2.workerID AND t1.timeVar=t2.timeVar+1
ORDER BY t1.workerID, t1.timeVar ;
I am not clear whether this will do the job or not. Also, this is the firs time I do JOIN, and it is my impression that this query alone won't insert or update anything on myTable. Do I need to complement this query to accomplish what I explained at the beginning of my post?
Any thoughts, any help, much appreciated. I have been struggling with this for two weeks now, and I need to get this done.
Upvotes: 1
Views: 400
Reputation: 691
Update is a slow operation, because the database engine has to fetch the row, run the subquery to get the new value, and then store the row in place. Sometimes it doesn't fit into the place where it was and it must go somewhere else in the storage.
For better performance, I would do like this :
myLagTable
with the column statusLag
INSERT
(see below)myTable
to myOldTable
myLagTable
to myTable
Doing this, myTable
is unavailable during the rename operations.
To populate myLagTable
, you can then use the JOIN
:
INSERT INTO myLagTable(workerID, timeVar, status, statusLag)
SELECT t1.*, t2.status as statusLag
FROM myTable AS t1
LEFT JOIN myTable AS t2
ON t1.workerID=t2.workerID AND t1.timeVar=t2.timeVar+1;
Notice that I removed ORDER BY
, as it's not necessary in an INSERT
, and it will use ressources (CPU, memory, disk) to sort rows.
Once you are satisfied with the result, you can drop myOldTable
Upvotes: 0
Reputation: 8969
One strategy you could try is to break it up into a number of smaller updates.
That is, you don't try to update 3.2 billion rows at a time. Find a way to break it into maybe 30 groups of 100 million rows and do them one group at a time.
There are a couple of disadvantages of this:
But the big advantage is that it'll probably work (eventually).
Something like:
UPDATE myTable
SET statusLag = (SELECT t2.status
FROM myTable t2
WHERE t2.workerID = myTable.workerID AND
t2.timeVar < myTable.timeVar
ORDER BY t2.timeVar desc
LIMIT 1
)
WHERE companyID = 1;
And run that for each companyID in your database.
Or
WHERE companyID => 0 AND companyID < 1000;
etc.
The important thing is to let the database do the update and complete the transaction before you move to the next set of records to update. If you try to wrap it all in one transaction then you've got the same problem of having to manage updates to 3.2 billion rows in a dynamic update.
You might want to automate the update (e.g. write some java or something to loop through the companyIDs) or you could simply create the 30 or 40 SQL statements using decent text editor, then run them by hand.
If you need to maintain this data then I'd suggest you include the lag data when you create the row - one at a time is much easier than a large batch like this.
Upvotes: 1
Reputation: 1270061
If you want to change values in the table, then you need to use update
(or insert
or delete
).
If this is your query:
UPDATE myTable
SET statusLag = (SELECT t2.status
FROM myTable t2
WHERE t2.workerID = myTable.workerID AND
t2.timeVar < myTable.timeVar
ORDER BY t2.timeVar desc
LIMIT 1
);
Then you might have a chance with an index on myTable(workerId, timeVar, status)
. But, to be honest, with 3.2 billion rows and 16 Gbytes of RAM, this still might not help.
Almost any other database has constructs that could help (lag()
for variables in MySQL). Is there any possibility of using another database?
Upvotes: 0