Justin Eckhart
Justin Eckhart

Reputation: 45

Optimize MySQl query that is looking at large amounts of data

So I have an error log that I need to analyze. The log is a huge file with around 2.5 million rows.

In that error log there is are fields called:

EVENT_ATTRIBUTE that displays the name of the device that collected that information.

EVENT_SEVERITY that displays a number from 1 to 5. In this column I need to find the amount 1's, 2's, 4's and 5's.

This is a link to a screenshot of table Im using:

https://i.sstatic.net/Yk4dq.png

I basically need to get the event_attribute and count the number of event_severity and put them in seperate columns. So number of 1's and 2's will be split that is related to each event_attribute so I can the amount of errors of each type that each sensor(event_attribute) picks up.

Currently this is the code I have:

    SELECT LEFT(EVENT_ATTRIBUTE, LOCATE('(', EVENT_ATTRIBUTE, 1)-1) AS 
    SensorName, 
       SUM(EVENT_SEVERITY = 1) CODE_1,
       SUM(EVENT_SEVERITY = 2) CODE_2,
       SUM(EVENT_SEVERITY = 4) ERROR_4,
       SUM(EVENT_SEVERITY = 5) ERROR_5
    FROM taddmapp.disc_event
    WHERE EVENT_SEVERITY = 5 OR EVENT_SEVERITY = 4 OR EVENT_SEVERITY = 2 OR 
      EVENT_SEVERITY = 1 
    GROUP BY LEFT(EVENT_ATTRIBUTE, LOCATE('(', EVENT_ATTRIBUTE, 1)-1);

The LEFT(EVENT_ATTRIBUTE, LOCATE('(', EVENT_ATTRIBUTE, 1)-1) part of the code just drops the ip address of the sensor being used and the other selects sum the number of the status codes. The code runs fine in bringing me the results I need but the only problem is that it takes wayyyyy tooo long. I ran it the other day and it took over an hour to execute this query. I was wondering if there was any way I could optimize this query. I am not too good SQL so I needed help optimizing.

Thank you for your time!

Upvotes: 2

Views: 65

Answers (3)

spencer7593
spencer7593

Reputation: 108430

If adding a column to the table isn't an option, then we could try rewriting the query to make effective use of an index that has EVENT_ATTRIBUTE as the leading column. With an index like this:

... ON taddmapp.disc_event (EVENT_ATTRIBUTE,EVENT_SEVERITY)

we've got a shot at getting MySQL to use that index and avoid a "Using filesort" operation to satisfy the GROUP BY in the inner query.

SELECT SUBSTRING_INDEX(t.EVENT_ATTRIBUTE,'(',1) AS 
SensorName
     , SUM(t.CODE_1) AS CODE_1
     , SUM(t.CODE_2) AS CODE_2
     , SUM(t.ERROR_4) AS ERROR_4
     , SUM(t.ERROR_5) AS ERROR_5
  FROM ( SELECT e.EVENT_ATTRIBUTE
              , SUM(e.EVENT_SEVERITY = 1) AS CODE_1
              , SUM(e.EVENT_SEVERITY = 2) AS CODE_2
              , SUM(e.EVENT_SEVERITY = 4) AS ERROR_4
              , SUM(e.EVENT_SEVERITY = 5) AS ERROR_5
           FROM taddmapp.disc_event e
          WHERE e.EVENT_SEVERITY IN (1,2,4,5)
          GROUP BY e.EVENT_ATTRIBUTE
       ) t
 GROUP
    BY SUBSTRING_INDEX(t.EVENT_ATTRIBUTE,'(',1)

NOTE: The GROUP BY on the outer query would still require a "Using filesort" operation, but the goal here is to have that outer query operate on a much smaller set of rows (assuming that the GROUP BY in the inline view collapses those 2.5 million rows into a much more reasonably sized set.)


If we had just a bare column, e.g. EVENT_ATTRIBUTE_PREFIX that was populated with just that leading portion of EVENT_ATTRIBUTE we are interested in, we could avoid the inline view. Assuming we had that column, and it was the leading column in an index, e.g.

... ON taddmapp.disc_event (EVENT_ATTRIBUTE_PREFIX,EVENT_SEVERITY)

Then MySQL could make use of that index to satisfy the GROUP BY operation, without requiring a "Using filesort" operation, with a query like this:

SELECT e.EVENT_ATTRIBUTE_PREFIX  AS SensorName
     , SUM(e.EVENT_SEVERITY = 1) AS CODE_1
     , SUM(e.EVENT_SEVERITY = 2) AS CODE_2
     , SUM(e.EVENT_SEVERITY = 4) AS ERROR_4
     , SUM(e.EVENT_SEVERITY = 5) AS ERROR_5
  FROM taddmapp.disc_event e
 WHERE e.EVENT_SEVERITY IN (1,2,4,5)
 GROUP BY e.EVENT_ATTRIBUTE_PREFIX

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270021

Simplify the WHERE clause:

SELECT LEFT(EVENT_ATTRIBUTE, LOCATE('(', EVENT_ATTRIBUTE, 1)-1) AS 
SensorName, 
       SUM(EVENT_SEVERITY = 1) as CODE_1,
       SUM(EVENT_SEVERITY = 2) as CODE_2,
       SUM(EVENT_SEVERITY = 4) as ERROR_4,
       SUM(EVENT_SEVERITY = 5) as ERROR_5
FROM taddmapp.disc_event
WHERE EVENT_SEVERITY IN (1, 2, 4, 5)
GROUP BY LEFT(EVENT_ATTRIBUTE, LOCATE('(', EVENT_ATTRIBUTE, 1)-1);

You can improve the performance with an index on disc_event(event_severity, event_attribute).

Then there is not much you can do because of the group by.

Upvotes: 1

Uueerdo
Uueerdo

Reputation: 15951

You're calling a couple string functions (LEFT() and LOCATE()) on every row in the table; that will definitely impact performance. If you could add another field to the table to permanently hold the substring/value you are extracting, that field could be used instead, negating the need to extract it for every such query; and allow indexing the field to give even more performance benefits.

If you make the new field nullable and default to null, you could run a query to update the newer rows with the value extracted; or better, create a before insert trigger that calculates the field value as the results are inserted.

Upvotes: 1

Related Questions