hedbisker
hedbisker

Reputation: 327

Optimize SQL Query (If possible) using CONVERT(INT, SUBSTRING( and LEN FUNCTION

My situation is like that :

I have these tables:

CREATE TABLE [dbo].[HeaderResultPulser] 
(
    [Id]                  BIGINT     IDENTITY (1, 1) NOT NULL,
    [ReportNumber]        CHAR(255)  NOT NULL,
    [ReportDescription]   CHAR(255)  NOT NULL,
    [CatalogNumber]       NCHAR(255) NOT NULL,
    [WorkerName]          NCHAR(255) DEFAULT ('') NOT NULL,
    [LastCalibrationDate] DATETIME   NOT NULL,
    [NextCalibrationDate] DATETIME   NOT NULL,
    [MachineNumber]       INT        NOT NULL,
    [EditTime]            DATETIME   NOT NULL,
    [Age]                 NCHAR(255) DEFAULT ((1)) NOT NULL,
    [Current]             INT        DEFAULT ((-1)) NOT NULL,
    [Time]                BIGINT     DEFAULT ((-1)) NOT NULL,
    [MachineName]         NVARCHAR(MAX) DEFAULT ('') NOT NULL,
    [BatchNumber]         NVARCHAR(MAX) DEFAULT ('') NOT NULL,

    CONSTRAINT [PK_HeaderResultPulser] 
        PRIMARY KEY CLUSTERED ([Id] ASC)
);

CREATE TABLE [dbo].[ResultPulser] 
(
    [Id]           BIGINT    IDENTITY (1, 1) NOT NULL,
    [ReportNumber] CHAR(255) NOT NULL,
    [BatchNumber]  CHAR(255) NOT NULL,
    [DateTime]     DATETIME  NOT NULL,
    [Ocv]          FLOAT(53) NOT NULL,
    [OcvMin]       FLOAT(53) NOT NULL,
    [OcvMax]       FLOAT(53) NOT NULL,
    [Ccv]          FLOAT(53) NOT NULL,
    [CcvMin]       FLOAT(53) NOT NULL,
    [CcvMax]       FLOAT(53) NOT NULL,
    [Delta]        BIGINT    NOT NULL,
    [DeltaMin]     BIGINT    NOT NULL,
    [DeltaMax]     BIGINT    NOT NULL,
    [CurrentFail]  BIT       DEFAULT ((0)) NOT NULL,
    [NumberInTest] INT       NOT NULL
);

For every row in HeaderResultPulser I have multiple rows in ResultPulser my key is the [HeaderResultPulser].[ReportNumber] to get a list of data in ResultPulser, and for every a lot of row with the same [ResultPulser].[ReportNumber]

It has multiple [ResultPulser].[NumberInTest] values

For example: in the ResultPulser table the data can look like this:

ReportNumber | NumberInTest
-------------+-------------
 0000006211  |      1
 0000006211  |      2
 0000006211  |      3
 0000006211  |      4
 0000006211  |      5
 0000006211  |      6
 0000006212  |      1
 0000006212  |      2
 0000006212  |      3
 0000006212  |      4
 0000006212  |      5

NumberInTest can be 200, 500, 10000 and sometime even more..

The report number column contains two the first 7 chars are a number of machine and the rest is an incrementing number.

For example, 0000006212 is [0000006][212] == [the machine number][the incrementing number]

My query for example :

select 
    [HeaderResultPulser].[ReportNumber],
    max(NumberInTest) as TotalCells
from 
    ResultPulser, HeaderResultPulser 
where 
    ((([ResultPulser].[ReportNumber] like '0000006%' and
       CONVERT(INT, SUBSTRING([ResultPulser].[ReportNumber], 8,    LEN([ResultPulser].[ReportNumber]))) BETWEEN '211' AND '815') 
      and ([HeaderResultPulser].[ReportNumber] = [ResultPulser].[ReportNumber])))
group by 
    [HeaderResultPulser].[ReportNumber]

Actually I want to get all the rows on the machine number 0000006 that number was 211 to 815 (include both)

This query takes about 6-7 seconds

There is a lot of data (in the hundreds of millions and billions and in the future can be more and can be much more in table ResultPulser), and it can get Tens of thousands of rows in HeaderResultPulser table

And In getting receive I only receive on select a few hundred in the worst case a thousand or about two thousand if I want to go far... but (in numbers) to get the max(NumberInTest) from ResultPulser I take about (It can get to a few millions of rows)

There is any way to optimize my query? Or when It's so much data it's just must this time? (That just the way it is)

Upvotes: 0

Views: 317

Answers (4)

Gienek
Gienek

Reputation: 31

If you really want to optimize performance, I propose to add a bit of logic beyond SQL structures. Is it possible that particular value of ReportNumber is present in table ResultPulser, but not in table HeaderResultPulser? If not, and I ssupose so, there is no reason to join table HeaderResultPulser. Then, I propose to take advantage from fact, that the condition on ReportNumber can be expressed equivalently without dividing in substrings. For your example, the condition

([ResultPulser].[ReportNumber] like '0000006%' and
 CONVERT(INT, SUBSTRING([ResultPulser].[ReportNumber], 8,
  LEN([ResultPulser].[ReportNumber]))) BETWEEN '211' AND '815') 

is equivalent to:

([ResultPulser].[ReportNumber] BETWEEN '0000006211' and '0000006815')

So the proposal is:

  1. Create index on table ResultPulser(ReportNumber, NumberInTest)
  2. Use selections similar to this:
select ReportNumber, max(NumberInTest) as TotalCells
from ResultPulser
where 
    ReportNumber BETWEEN '0000006211' and '0000006815'
group by 
    ReportNumber

(Please, add brackets or double quotes and capitalizations as necessary for MS SQL Server and your taste)

I would expect that good database will execute this query by index-only access, and it will be optimal from execution point of view. Performance depends on not only on execution path, but also on setup and hardware. Please, make sure that your database has enough cache and fast disk accesses. Also concurrent load is very important.

Simple splitting the field ReportNumber into [the machine number] and [the incrementing number] will probably not improve performance of the query in form proposed by me. But it may be very convenient for other forms of access (other WHERE classes). And it will reflect the structure of the case. Even more important: It will release you from imposed limits. Currently, you have 3 digits for the [the incrementing number]. Are you sure, it will never be necessary to have more than 999 of them for single [the machine number]?

Why the field ReportNumber has type char(255), when only 10 characters are used? char(255) has fixed length, so it will be terrible wasting of space. Only database compression can help. Used space has strong influence on performance – Please, consider the above remark about the database cache.

If both these fields, [the machine number], [the incrementing number], are intergers, why not split ReportNumber and use integer type for them?

Side remark: Field names suggest that you search the total number of rows in table ResultPulser, which belong to single entry in table HeaderResultPulser. The proposed query will deliver this, only if numbers in NumberInTest are consecutive, without gaps. If this is not supplied, you have to count them rather than seek the maximum.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1270623

The biggest performance gain might be eliminating the outer group by by using a correlated subquery or lateral join:

select hrp.[ReportNumber],
       (select max(rp.NumberInTest)
        from ResultPulser rp
        where rp.ReportNumber = hrp.ReportNumber and
               right(rp.ReportNumber, 3) between '211' and '815'
       ) as TotalCells
from HeaderResultPulser hrp
where hrp.ReportNumber like '0000006%';

Your logic looks like it only wants the last three characters of the ReportNumber, so I simplified the logic. I'm not 100% that is the case -- it just seems reasonable. Regardless, there is no need to convert the values to integers and then compare as strings. And similar logic can be used even for longer report numbers.

You also want an index on ResultPulser(ReportNumber, NumberInTest) :

create index idx_resultpulser_reportnumber_numberintest on ResultPulser(ReportNumber, NumberInTest)

EDIT:

Actually, I notice that the report number matches between the two tables. So this seems simplest:

select hrp.[ReportNumber],
       (select max(rp.NumberInTest)
        from ResultPulser rp
        where rp.ReportNumber = hrp.ReportNumber 
       ) as TotalCells
from HeaderResultPulser hrp
where hrp.ReportNumber >= '0000006211' and
      hrp.ReportNumber <= '0000006815';

You still want to be sure you have the above index on ResultPulser.

If the ReportNumber is not a fixed 10 digits, then you can use:

where hrp.ReportNumber >= '0000006211' and
      hrp.ReportNumber <= '0000006815' and
      len(hrp.ReportNumber) = 10

This should also use the index and return exactly what you want.

Upvotes: 1

Akhilesh
Akhilesh

Reputation: 196

Performance Optimization of any query depends on many factors including environment you are hosting and running your query. Hardware and Software play important part in optimization of heavy running database queries. In your case you can look into following things:

  1. USE ANSI 92 JOIN syntax instead of default cross join e.g select * from T1 join T2 on T1.column = T2.column

  2. Put indexes on columns like [ReportNumber] [NumberInTest] Note: You may need index for each column in the join area which is not primary key.

  3. Remember use of MAX is always heavy and that could be the main problem in your query.

  4. Finally you can further look into optimizing your query syntax using following online tool where you can specify your actual query and environment you are using:

https://www.eversql.com/

Hope it help you.

Upvotes: 0

allmhuran
allmhuran

Reputation: 4464

The way you are doing joins is no longer standard. It's also hard to read, and dangerous if you ever need to use left joins. Instead of joining this way:

select * 
from   T1, T2 
where  T1.column = T2.column

Use ANSI-92 join syntax instead:

select * 
from   T1 
join   T2 on T1.column = T2.column

You said that your "key" was ReportNumber. Why isn't that declared in your schema? It sounds like you want a unique constraint on HeaderResultPulser.ReportNumber, and a foreign key on the the ReportPulser table, such that ReportNumber references HeaderResultPulser (ReportNumber)

Since your report number column seems to contain two different values, your table is not in First Normal Form. This is making things difficult for you. Why not split the two parts of the "report number" into two different columns when the data is entered? This will significantly improve your query performance, because you no longer need to perform an expression against the data in the table at query time to separate the ReportNumber into atomic values.

Your comment says that the first 7 characters of the ReportNumber are the MachineNumber. But you already have MachineNumber in the HeaderReportPulser table. So why not just add a separate column for Increment? If you still need ReportNumber to exist as a column, you can make it a calculated column, as the concatenation of MachineNumber and Increment.

If you don't want to touch the "existing" schema, we can do a similar thing in reverse. Your query will not be completely sargable unless you can do something to the schema, because you have to perform some kind of expression on the data in the ReportNumber column. But maybe you have the option to use a calculated column to do this up front:

alter table HeaderReportPulser 
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7);

Now we have the increment as a column in its own right. But it's still being calculated at query time, because it's not persisted. We can make it persisted:

alter table HeaderReportPulser 
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7) persisted;

We can also index a computed column. Since your required expression is deterministic and precise (see Indexes on Computed Columns), we don't actually have to mark it as persisted:

alter table HeaderReportPulser 
add Increment as right(ReportNumber, len(rtrim(ReportNumber)) - 7);
create index ix_headerreportpulser_increment on HeaderReportPulser(Increment);

You could do a similar set of operations to create the Increment and MachineNumber on the ReportPulser table. If you always want to use both values, create an index on the combination of (MachineNumber, Increment)

Upvotes: 3

Related Questions