Reputation: 327
My situation is like that :
I have these tables:
CREATE TABLE [dbo].[HeaderResultPulser]
(
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[ReportNumber] CHAR(255) NOT NULL,
[ReportDescription] CHAR(255) NOT NULL,
[CatalogNumber] NCHAR(255) NOT NULL,
[WorkerName] NCHAR(255) DEFAULT ('') NOT NULL,
[LastCalibrationDate] DATETIME NOT NULL,
[NextCalibrationDate] DATETIME NOT NULL,
[MachineNumber] INT NOT NULL,
[EditTime] DATETIME NOT NULL,
[Age] NCHAR(255) DEFAULT ((1)) NOT NULL,
[Current] INT DEFAULT ((-1)) NOT NULL,
[Time] BIGINT DEFAULT ((-1)) NOT NULL,
[MachineName] NVARCHAR(MAX) DEFAULT ('') NOT NULL,
[BatchNumber] NVARCHAR(MAX) DEFAULT ('') NOT NULL,
CONSTRAINT [PK_HeaderResultPulser]
PRIMARY KEY CLUSTERED ([Id] ASC)
);
CREATE TABLE [dbo].[ResultPulser]
(
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[ReportNumber] CHAR(255) NOT NULL,
[BatchNumber] CHAR(255) NOT NULL,
[DateTime] DATETIME NOT NULL,
[Ocv] FLOAT(53) NOT NULL,
[OcvMin] FLOAT(53) NOT NULL,
[OcvMax] FLOAT(53) NOT NULL,
[Ccv] FLOAT(53) NOT NULL,
[CcvMin] FLOAT(53) NOT NULL,
[CcvMax] FLOAT(53) NOT NULL,
[Delta] BIGINT NOT NULL,
[DeltaMin] BIGINT NOT NULL,
[DeltaMax] BIGINT NOT NULL,
[CurrentFail] BIT DEFAULT ((0)) NOT NULL,
[NumberInTest] INT NOT NULL
);
For every row in HeaderResultPulser
I have multiple rows in ResultPulser
my key is the [HeaderResultPulser].[ReportNumber] to get a list of data in ResultPulser
, and for every a lot of row with the same [ResultPulser].[ReportNumber]
It has multiple [ResultPulser].[NumberInTest]
values
For example: in the ResultPulser
table the data can look like this:
ReportNumber | NumberInTest
-------------+-------------
0000006211 | 1
0000006211 | 2
0000006211 | 3
0000006211 | 4
0000006211 | 5
0000006211 | 6
0000006212 | 1
0000006212 | 2
0000006212 | 3
0000006212 | 4
0000006212 | 5
NumberInTest
can be 200, 500, 10000 and sometime even more..
The report number column contains two the first 7 chars are a number of machine and the rest is an incrementing number.
For example, 0000006212 is [0000006][212] == [the machine number][the incrementing number]
My query for example :
select
[HeaderResultPulser].[ReportNumber],
max(NumberInTest) as TotalCells
from
ResultPulser, HeaderResultPulser
where
((([ResultPulser].[ReportNumber] like '0000006%' and
CONVERT(INT, SUBSTRING([ResultPulser].[ReportNumber], 8, LEN([ResultPulser].[ReportNumber]))) BETWEEN '211' AND '815')
and ([HeaderResultPulser].[ReportNumber] = [ResultPulser].[ReportNumber])))
group by
[HeaderResultPulser].[ReportNumber]
Actually I want to get all the rows on the machine number 0000006 that number was 211 to 815 (include both)
This query takes about 6-7 seconds
There is a lot of data (in the hundreds of millions and billions and in the future can be more and can be much more in table ResultPulser), and it can get Tens of thousands of rows in HeaderResultPulser table
And In getting receive I only receive on select a few hundred in the worst case a thousand or about two thousand if I want to go far... but (in numbers) to get the max(NumberInTest)
from ResultPulser
I take about (It can get to a few millions of rows)
There is any way to optimize my query? Or when It's so much data it's just must this time? (That just the way it is)
Upvotes: 0
Views: 317
Reputation: 31
If you really want to optimize performance, I propose to add a bit of logic beyond SQL structures. Is it possible that particular value of ReportNumber is present in table ResultPulser, but not in table HeaderResultPulser? If not, and I ssupose so, there is no reason to join table HeaderResultPulser. Then, I propose to take advantage from fact, that the condition on ReportNumber can be expressed equivalently without dividing in substrings. For your example, the condition
([ResultPulser].[ReportNumber] like '0000006%' and
CONVERT(INT, SUBSTRING([ResultPulser].[ReportNumber], 8,
LEN([ResultPulser].[ReportNumber]))) BETWEEN '211' AND '815')
is equivalent to:
([ResultPulser].[ReportNumber] BETWEEN '0000006211' and '0000006815')
So the proposal is:
select ReportNumber, max(NumberInTest) as TotalCells
from ResultPulser
where
ReportNumber BETWEEN '0000006211' and '0000006815'
group by
ReportNumber
(Please, add brackets or double quotes and capitalizations as necessary for MS SQL Server and your taste)
I would expect that good database will execute this query by index-only access, and it will be optimal from execution point of view. Performance depends on not only on execution path, but also on setup and hardware. Please, make sure that your database has enough cache and fast disk accesses. Also concurrent load is very important.
Simple splitting the field ReportNumber into [the machine number] and [the incrementing number] will probably not improve performance of the query in form proposed by me. But it may be very convenient for other forms of access (other WHERE classes). And it will reflect the structure of the case. Even more important: It will release you from imposed limits. Currently, you have 3 digits for the [the incrementing number]. Are you sure, it will never be necessary to have more than 999 of them for single [the machine number]?
Why the field ReportNumber has type char(255), when only 10 characters are used? char(255) has fixed length, so it will be terrible wasting of space. Only database compression can help. Used space has strong influence on performance – Please, consider the above remark about the database cache.
If both these fields, [the machine number], [the incrementing number], are intergers, why not split ReportNumber and use integer type for them?
Side remark: Field names suggest that you search the total number of rows in table ResultPulser, which belong to single entry in table HeaderResultPulser. The proposed query will deliver this, only if numbers in NumberInTest are consecutive, without gaps. If this is not supplied, you have to count them rather than seek the maximum.
Upvotes: 0
Reputation: 1270623
The biggest performance gain might be eliminating the outer group by
by using a correlated subquery or lateral join:
select hrp.[ReportNumber],
(select max(rp.NumberInTest)
from ResultPulser rp
where rp.ReportNumber = hrp.ReportNumber and
right(rp.ReportNumber, 3) between '211' and '815'
) as TotalCells
from HeaderResultPulser hrp
where hrp.ReportNumber like '0000006%';
Your logic looks like it only wants the last three characters of the ReportNumber
, so I simplified the logic. I'm not 100% that is the case -- it just seems reasonable. Regardless, there is no need to convert the values to integers and then compare as strings. And similar logic can be used even for longer report numbers.
You also want an index on ResultPulser(ReportNumber, NumberInTest)
:
create index idx_resultpulser_reportnumber_numberintest on ResultPulser(ReportNumber, NumberInTest)
EDIT:
Actually, I notice that the report number matches between the two tables. So this seems simplest:
select hrp.[ReportNumber],
(select max(rp.NumberInTest)
from ResultPulser rp
where rp.ReportNumber = hrp.ReportNumber
) as TotalCells
from HeaderResultPulser hrp
where hrp.ReportNumber >= '0000006211' and
hrp.ReportNumber <= '0000006815';
You still want to be sure you have the above index on ResultPulser
.
If the ReportNumber
is not a fixed 10 digits, then you can use:
where hrp.ReportNumber >= '0000006211' and
hrp.ReportNumber <= '0000006815' and
len(hrp.ReportNumber) = 10
This should also use the index and return exactly what you want.
Upvotes: 1
Reputation: 196
Performance Optimization of any query depends on many factors including environment you are hosting and running your query. Hardware and Software play important part in optimization of heavy running database queries. In your case you can look into following things:
USE ANSI 92 JOIN syntax instead of default cross join e.g select * from T1 join T2 on T1.column = T2.column
Put indexes on columns like [ReportNumber] [NumberInTest] Note: You may need index for each column in the join area which is not primary key.
Remember use of MAX is always heavy and that could be the main problem in your query.
Finally you can further look into optimizing your query syntax using following online tool where you can specify your actual query and environment you are using:
Hope it help you.
Upvotes: 0
Reputation: 4464
The way you are doing joins is no longer standard. It's also hard to read, and dangerous if you ever need to use left joins. Instead of joining this way:
select *
from T1, T2
where T1.column = T2.column
Use ANSI-92 join syntax instead:
select *
from T1
join T2 on T1.column = T2.column
You said that your "key" was ReportNumber
. Why isn't that declared in your schema? It sounds like you want a unique constraint on HeaderResultPulser.ReportNumber
, and a foreign key
on the the ReportPulser
table, such that ReportNumber references HeaderResultPulser (ReportNumber)
Since your report number column seems to contain two different values, your table is not in First Normal Form. This is making things difficult for you. Why not split the two parts of the "report number" into two different columns when the data is entered? This will significantly improve your query performance, because you no longer need to perform an expression against the data in the table at query time to separate the ReportNumber
into atomic values.
Your comment says that the first 7 characters of the ReportNumber
are the MachineNumber
. But you already have MachineNumber
in the HeaderReportPulser
table. So why not just add a separate column for Increment
? If you still need ReportNumber
to exist as a column, you can make it a calculated column, as the concatenation of MachineNumber
and Increment
.
If you don't want to touch the "existing" schema, we can do a similar thing in reverse. Your query will not be completely sargable unless you can do something to the schema, because you have to perform some kind of expression on the data in the ReportNumber
column. But maybe you have the option to use a calculated column to do this up front:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7);
Now we have the increment as a column in its own right. But it's still being calculated at query time, because it's not persisted. We can make it persisted:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7) persisted;
We can also index a computed column. Since your required expression is deterministic and precise (see Indexes on Computed Columns), we don't actually have to mark it as persisted:
alter table HeaderReportPulser
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7);
create index ix_headerreportpulser_increment on HeaderReportPulser(Increment);
You could do a similar set of operations to create the Increment
and MachineNumber
on the ReportPulser
table. If you always want to use both values, create an index on the combination of (MachineNumber, Increment)
Upvotes: 3