Reputation: 3981
I have a database table that receives close to 1 million inserts a day that needs to be searchable for at least a year. Big hard drive and lots of data and not that great hardware to put it on either.
The table looks like this:
id | tag_id | value | time
----------------------------------------
279571 55 0.57 2013-06-18 12:43:22
...
tag_id
might be something like AmbientTemperature
or AmbientHumidity
and the time is captured when the reading is taken from the sensor.
I'm querying on this table in a reporting format. I want to see all data for tags 1,55,72, and 4 between 2013-11-1 and 2013-11-28 at 1 hour intervals.
SELECT time, tag_id, tag_name, value, friendly_name
FROM (
SELECT time, tag_name, tag_id, value,friendly_name,
ROW_NUMBER() over (partition by tag_id,datediff(hour, 0, time)/1 order by time desc) as seqnum
FROM tag_values tv
JOIN tag_names tn ON tn.id = tv.tag_id
WHERE (tag_id = 1 OR tag_id = 55 OR tag_id = 72 OR tag_id = 4)
AND time >= '2013-11-1' AND time < '2013-11-28'
) k
WHERE seqnum = 1
ORDER BY time";
Can I optimize this table or my query at all? How should I set up my indexes?
It's pretty slow with a table size of 100 million + rows. It can take several minutes to get a data set of 7 days at an hourly interval with 3 tags in the query.
Upvotes: 0
Views: 129
Reputation: 205
Filtering on the result of the row number function will make the query painfully slow. Also it will prevent optimal index use.
If your primary reporting need is hourly information you might want to consider storing which rows are the first sensor reading for a tag in a specific hour.
ALTER TABLE tag_values ADD IsHourlySensorReading BIT NULL;
In an hourly process, you calculate this column for new rows.
DECLARE @CalculateFrom DATETIME = (SELECT MIN(time) FROM tag_values WHERE IsHourlySensorReading IS NULL);
SET @CalculateFrom = dateadd(hour, 0, datediff(hour, 0, @CalculateFrom));
UPDATE k
SET IsHourlySensorReading = CASE seqnum WHEN 1 THEN 1 ELSE 0 END
FROM (
SELECT id, row_number() over (partition by tag_id,datediff(hour, 0, time)/1 order by time desc) as seqnum
FROM tag_values tv
WHERE tv.time >= @CalculateFrom
AND tv.IsHourlySensorReading IS NULL
) as k
Your reporting query then becomes much simpler:
SELECT time, tag_id, tag_name, value, friendly_name
FROM (
SELECT time, tag_name, tag_id, value,friendly_name
FROM tag_values tv
JOIN tag_names tn ON tn.id = tv.tag_id
WHERE (tag_id = 1 OR tag_id = 55 OR tag_id = 72 OR tag_id = 4)
AND time >= '2013-11-1' AND time < '2013-11-28'
AND IsHourlySensorReading=1
) k
ORDER BY time;
The following index will help calculating the IsHourlySensorReading column. But remember, indexes will also cause your million inserts per day to take more time. Test thoroughly!
CREATE NONCLUSTERED INDEX tag_values_ixnc01 ON tag_values (time, IsHourlySensorReading) WHERE (IsHourlySensorReading IS NULL);
Use this index for reporting if you need order by time.
CREATE NONCLUSTERED INDEX tag_values_ixnc02 ON tag_values (time, tag_id, IsHourlySensorReading) INCLUDE (value) WHERE (IsHourlySensorReading = 1);
Use this index for reporting if you don't need order by time.
CREATE NONCLUSTERED INDEX tag_values_ixnc02 ON tag_values (tag_id, time, IsHourlySensorReading) INCLUDE (value) WHERE (IsHourlySensorReading = 1);
Some additional things to consider:
Upvotes: 1
Reputation: 196
I'm not an expert on sqlserver, but I would seriously consider setting this up as a partitioned table. This would also make archiving easier as partitions could simply be dropped (rather than an expensive delete from where...).
Also (with a bit of luck) the optimiser will only look in the partitions required for the data.
Upvotes: 0