MySQL - SQL code optimization

Question

I'm pretty new to writing SQL and have just built a couple of procedures to add data to my MySQL database. The problem is that it is extremely slow, due to the large number of queries. What I do now is loop through each record in a table containing the unsorted, raw data and then take that data point and add into the database. This becomes complicated as I have a number of FKs that I have to deal with.

Can you please help me optimize this?

As an example, to add the specified table I do: CALL add_table1(112,15);

Procedure to add data
-- Primary procedure

CREATE PROCEDURE `add_table1`(
    IN c_id INT UNSIGNED;
    IN t_id INT UNSIGNED;
)
BEGIN
    -- Table variables
    DECLARE r_id INT UNSIGNED;
    DECLARE dh_name VARCHAR(50);
    DECLARE d_value DECIMAL(20,10);

    -- Loop variables
    DECLARE done BOOLEAN;

    -- Cursor for measurement table
    DECLARE m_cur CURSOR FOR 
        SELECT Run_ID, DataHeader_Name, Data_Value 
        FROM `measurements`.`measurement_20131029_152902`;

    -- Handlers for exceptions
    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

    -- Set start time
    UPDATE `measurements`.`queue` 
        SET Start_Time = NOW() 
        WHERE Experiment_ID = 112 AND Procedure_Name = 'add_table1';

    -- Loop through measurement table
    OPEN m_cur;
    m_loop: LOOP
        FETCH m_cur INTO r_id, dh_name, d_value;
        IF done THEN
            CLOSE m_cur;
            LEAVE m_loop;
        END IF;
        CALL add_measurement(dh_name, d_value, t_id, c_id, r_id);
    END LOOP m_loop;
END

Procedure to add measurement
-- Secondary procedure, called from add_table1

CREATE PROCEDURE `add_measurement`(
    IN measurement_header VARCHAR(50),
    IN measurement_value DECIMAL(20,10),
    IN tool_id_var INT UNSIGNED,
    IN config_id_var INT UNSIGNED,
    IN run_id_var INT UNSIGNED
)
BEGIN
    -- Variables representing FKs
    DECLARE data_header_id INT UNSIGNED;
    DECLARE tool_header_link_id INT UNSIGNED;
    DECLARE tool_data_id INT UNSIGNED;
    DECLARE tool_data_link_id INT UNSIGNED;

    -- Add header
    INSERT IGNORE INTO data_headers(DataHeader_Name) 
        VALUES(measurement_header);
    SET data_header_id = (SELECT DataHeader_ID 
        FROM data_headers WHERE DataHeader_Name = measurement_header);

    -- Link header to tool
    INSERT IGNORE INTO tool_header_link(DataHeader_ID, Tool_ID)
        VALUES(data_header_id, tool_id_var);
    SET tool_header_link_id = (SELECT ToolHeaderLink_ID 
        FROM tool_header_link
        WHERE DataHeader_ID = data_header_id AND Tool_ID = tool_id_var);

    -- Add measurement
    INSERT IGNORE INTO tool_data(Data_Value) VALUES(measurement_value);
    SET tool_data_id = (SELECT ToolData_ID
        FROM tool_data WHERE Data_Value = measurement_value);

    -- Link measurement to header and configuration
    INSERT IGNORE INTO 
        tool_data_link(ToolHeaderLink_ID, ToolData_ID, Run_ID)
        VALUES(tool_header_link_id, tool_data_id, run_id_var);
    SET tool_data_link_id = (SELECT ToolDataLink_ID FROM tool_data_link
        WHERE ToolHeaderLink_ID = tool_header_link_id
        AND ToolData_ID = tool_data_id AND Run_ID = run_id_var);

    -- Link measurement to experiment configuration
    INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
        VALUES(tool_data_link_id, config_id_var);
END

Current Solution
I stumbled upon this solution about a similar issue. I enclosed the meat of the code inside of a TRANSACTION and immediately noticed a massive improvement in speed. Instead of the query's estimated completion time being about 36 hours, I got the actual completion time down to about 5 minutes! I also did a slight design change to the database and removed an unnecessary FK. If anyone sees further ways to improve this code, I am still interested. I have the performance into an acceptable range for our applications, but I am always interested in making things better.

To show the changes:

    START TRANSACTION;

    -- Loop through measurement table
    OPEN m_cur;
    m_loop: LOOP
        FETCH m_cur INTO r_id, dh_name, d_value;
        IF done THEN
            CLOSE m_cur;
            LEAVE m_loop;
        END IF;
        CALL add_measurement(dh_name, d_value, t_id, c_id, r_id);
    END LOOP m_loop;

    COMMIT;

Alternative Solution
Based off the answers below, I was able to update my new solution to the one below. From my testing, it appears that this new solution is functioning as desired. It is also more than twice as fast as the previous solution. Using this routine, I can add one million unique pieces of data in about 2.5 minutes!

Thank you all for your help!

CREATE PROCEDURE `add_table`(
    IN config_id_var INT UNSIGNED
)
BEGIN
    START TRANSACTION;

    -- Add header
    INSERT IGNORE INTO data_headers(DataHeader_Name)
        SELECT DataHeader_Name
        FROM `measurements`.`measurement_20131114_142402`;

    -- Add measurement
    INSERT IGNORE INTO tool_data(Data_Value)
        SELECT Data_Value
        FROM `measurements`.`measurement_20131114_142402`;

    -- Link measurement to header and configuration
        -- INSERT Non-Unique Values
    INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
        SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
        FROM `measurements`.`measurement_20131114_142402` AS m
        JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
        JOIN tool_data AS d ON d.Data_Value = m.Data_Value;
        -- INSERT Unique Values
    INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
        SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
        FROM `measurements`.`measurement_20131114_142402` AS m
        LEFT OUTER JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
        LEFT OUTER JOIN tool_data AS d ON d.Data_Value = m.Data_Value
        WHERE ((h.DataHeader_Name IS NULL) OR (d.Data_Value IS NULL));

    -- Link measurement to experiment configuration
        -- INSERT Non-Unique Values
    INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
        SELECT tdl.ToolDataLink_ID, config_id_var
        FROM tool_data_link AS tdl
        JOIN data_headers AS h ON h.DataHeader_ID = tdl.DataHeader_ID
        JOIN tool_data AS d ON d.ToolData_ID = tdl.ToolData_ID;
        -- INSERT Unique Values
    INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
        SELECT tdl.ToolDataLink_ID, config_id_var
        FROM tool_data_link AS tdl
        LEFT OUTER JOIN data_headers AS h ON h.DataHeader_ID = tdl.DataHeader_ID
        LEFT OUTER JOIN tool_data AS d ON d.ToolData_ID = tdl.ToolData_ID
        WHERE ((h.DataHeader_ID IS NULL) OR (d.ToolData_ID IS NULL));

    COMMIT;
END

Conclusion
I did some more testing with the solution that did not use cursors. It is definitely faster, initially; however, when the size of the database grows, the execution time drastically increases.

I added in a couple million data points into the database. I then tried adding a small data set of around a few hundred data points. It took nearly 400x longer than the cursor solution. I believe that is because the cursors only looked at the data points needed, where as, the joins had to look through all of the data.

Based off those results, it appears that the cursor solution will be better for my applications.

user645280 · Accepted Answer

Just as an example, take this statement from your question:

-- Link measurement to header and configuration
INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
    SELECT t1.DataHeader_ID, t2.ToolData_ID, t3.Run_ID
    FROM data_headers t1, tool_data t2, runs t3
    WHERE (( t1.DataHeader_Name in (
        SELECT DataHeader_Name
        FROM `measurements`.`measurement_20131114_110059`)
    ) AND ( t2.Data_Value in (
        SELECT Data_Value
        FROM `measurements`.`measurement_20131114_110059`)
    ) AND ( t3.Run_ID in (
        SELECT Run_ID
        FROM `measurements`.`measurement_20131114_110059`)));

Compare the performance of just that SELECT with the following:

    SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
    FROM measurements.measurement_20131114_110059 AS m
    JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
    JOIN tool_data AS d ON d.Data_Value = m.Data_Value
    ;

Note: You would absolutely need indexes on DataHeader_Name and Data_Value in both tables having it. (Note: joining with a temporary table is pretty much always slow and can never be properly indexed)

The bottom line here is that all your data is coming out of measurement_*. Do as little as possible with other tables if you want performance.

MySQL - SQL code optimization

Answers (2)

Related Questions