Reputation: 1285
I'm pretty new to writing SQL and have just built a couple of procedures to add data to my MySQL database. The problem is that it is extremely slow, due to the large number of queries. What I do now is loop through each record in a table containing the unsorted, raw data and then take that data point and add into the database. This becomes complicated as I have a number of FKs that I have to deal with.
Can you please help me optimize this?
As an example, to add the specified table I do: CALL add_table1(112,15);
Procedure to add data
-- Primary procedure
CREATE PROCEDURE `add_table1`(
IN c_id INT UNSIGNED;
IN t_id INT UNSIGNED;
)
BEGIN
-- Table variables
DECLARE r_id INT UNSIGNED;
DECLARE dh_name VARCHAR(50);
DECLARE d_value DECIMAL(20,10);
-- Loop variables
DECLARE done BOOLEAN;
-- Cursor for measurement table
DECLARE m_cur CURSOR FOR
SELECT Run_ID, DataHeader_Name, Data_Value
FROM `measurements`.`measurement_20131029_152902`;
-- Handlers for exceptions
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
-- Set start time
UPDATE `measurements`.`queue`
SET Start_Time = NOW()
WHERE Experiment_ID = 112 AND Procedure_Name = 'add_table1';
-- Loop through measurement table
OPEN m_cur;
m_loop: LOOP
FETCH m_cur INTO r_id, dh_name, d_value;
IF done THEN
CLOSE m_cur;
LEAVE m_loop;
END IF;
CALL add_measurement(dh_name, d_value, t_id, c_id, r_id);
END LOOP m_loop;
END
Procedure to add measurement
-- Secondary procedure, called from add_table1
CREATE PROCEDURE `add_measurement`(
IN measurement_header VARCHAR(50),
IN measurement_value DECIMAL(20,10),
IN tool_id_var INT UNSIGNED,
IN config_id_var INT UNSIGNED,
IN run_id_var INT UNSIGNED
)
BEGIN
-- Variables representing FKs
DECLARE data_header_id INT UNSIGNED;
DECLARE tool_header_link_id INT UNSIGNED;
DECLARE tool_data_id INT UNSIGNED;
DECLARE tool_data_link_id INT UNSIGNED;
-- Add header
INSERT IGNORE INTO data_headers(DataHeader_Name)
VALUES(measurement_header);
SET data_header_id = (SELECT DataHeader_ID
FROM data_headers WHERE DataHeader_Name = measurement_header);
-- Link header to tool
INSERT IGNORE INTO tool_header_link(DataHeader_ID, Tool_ID)
VALUES(data_header_id, tool_id_var);
SET tool_header_link_id = (SELECT ToolHeaderLink_ID
FROM tool_header_link
WHERE DataHeader_ID = data_header_id AND Tool_ID = tool_id_var);
-- Add measurement
INSERT IGNORE INTO tool_data(Data_Value) VALUES(measurement_value);
SET tool_data_id = (SELECT ToolData_ID
FROM tool_data WHERE Data_Value = measurement_value);
-- Link measurement to header and configuration
INSERT IGNORE INTO
tool_data_link(ToolHeaderLink_ID, ToolData_ID, Run_ID)
VALUES(tool_header_link_id, tool_data_id, run_id_var);
SET tool_data_link_id = (SELECT ToolDataLink_ID FROM tool_data_link
WHERE ToolHeaderLink_ID = tool_header_link_id
AND ToolData_ID = tool_data_id AND Run_ID = run_id_var);
-- Link measurement to experiment configuration
INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
VALUES(tool_data_link_id, config_id_var);
END
Current Solution
I stumbled upon this solution about a similar issue. I enclosed the meat of the code inside of a TRANSACTION
and immediately noticed a massive improvement in speed. Instead of the query's estimated completion time being about 36 hours, I got the actual completion time down to about 5 minutes! I also did a slight design change to the database and removed an unnecessary FK. If anyone sees further ways to improve this code, I am still interested. I have the performance into an acceptable range for our applications, but I am always interested in making things better.
To show the changes:
START TRANSACTION;
-- Loop through measurement table
OPEN m_cur;
m_loop: LOOP
FETCH m_cur INTO r_id, dh_name, d_value;
IF done THEN
CLOSE m_cur;
LEAVE m_loop;
END IF;
CALL add_measurement(dh_name, d_value, t_id, c_id, r_id);
END LOOP m_loop;
COMMIT;
Alternative Solution
Based off the answers below, I was able to update my new solution to the one below. From my testing, it appears that this new solution is functioning as desired. It is also more than twice as fast as the previous solution. Using this routine, I can add one million unique pieces of data in about 2.5 minutes!
Thank you all for your help!
CREATE PROCEDURE `add_table`(
IN config_id_var INT UNSIGNED
)
BEGIN
START TRANSACTION;
-- Add header
INSERT IGNORE INTO data_headers(DataHeader_Name)
SELECT DataHeader_Name
FROM `measurements`.`measurement_20131114_142402`;
-- Add measurement
INSERT IGNORE INTO tool_data(Data_Value)
SELECT Data_Value
FROM `measurements`.`measurement_20131114_142402`;
-- Link measurement to header and configuration
-- INSERT Non-Unique Values
INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
FROM `measurements`.`measurement_20131114_142402` AS m
JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
JOIN tool_data AS d ON d.Data_Value = m.Data_Value;
-- INSERT Unique Values
INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
FROM `measurements`.`measurement_20131114_142402` AS m
LEFT OUTER JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
LEFT OUTER JOIN tool_data AS d ON d.Data_Value = m.Data_Value
WHERE ((h.DataHeader_Name IS NULL) OR (d.Data_Value IS NULL));
-- Link measurement to experiment configuration
-- INSERT Non-Unique Values
INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
SELECT tdl.ToolDataLink_ID, config_id_var
FROM tool_data_link AS tdl
JOIN data_headers AS h ON h.DataHeader_ID = tdl.DataHeader_ID
JOIN tool_data AS d ON d.ToolData_ID = tdl.ToolData_ID;
-- INSERT Unique Values
INSERT IGNORE INTO tool_link(ToolDataLink_ID, Config_ID)
SELECT tdl.ToolDataLink_ID, config_id_var
FROM tool_data_link AS tdl
LEFT OUTER JOIN data_headers AS h ON h.DataHeader_ID = tdl.DataHeader_ID
LEFT OUTER JOIN tool_data AS d ON d.ToolData_ID = tdl.ToolData_ID
WHERE ((h.DataHeader_ID IS NULL) OR (d.ToolData_ID IS NULL));
COMMIT;
END
Conclusion
I did some more testing with the solution that did not use cursors. It is definitely faster, initially; however, when the size of the database grows, the execution time drastically increases.
I added in a couple million data points into the database. I then tried adding a small data set of around a few hundred data points. It took nearly 400x longer than the cursor solution. I believe that is because the cursors only looked at the data points needed, where as, the joins had to look through all of the data.
Based off those results, it appears that the cursor solution will be better for my applications.
Upvotes: 4
Views: 381
Reputation:
Just as an example, take this statement from your question:
-- Link measurement to header and configuration
INSERT IGNORE INTO tool_data_link(DataHeader_ID, ToolData_ID, Run_ID)
SELECT t1.DataHeader_ID, t2.ToolData_ID, t3.Run_ID
FROM data_headers t1, tool_data t2, runs t3
WHERE (( t1.DataHeader_Name in (
SELECT DataHeader_Name
FROM `measurements`.`measurement_20131114_110059`)
) AND ( t2.Data_Value in (
SELECT Data_Value
FROM `measurements`.`measurement_20131114_110059`)
) AND ( t3.Run_ID in (
SELECT Run_ID
FROM `measurements`.`measurement_20131114_110059`)));
Compare the performance of just that SELECT
with the following:
SELECT h.DataHeader_ID, d.ToolData_ID, m.Run_ID
FROM measurements.measurement_20131114_110059 AS m
JOIN data_headers AS h ON h.DataHeader_Name = m.DataHeader_Name
JOIN tool_data AS d ON d.Data_Value = m.Data_Value
;
Note: You would absolutely need indexes on DataHeader_Name and Data_Value in both tables having it. (Note: joining with a temporary table is pretty much always slow and can never be properly indexed)
The bottom line here is that all your data is coming out of measurement_*. Do as little as possible with other tables if you want performance.
Upvotes: 1
Reputation: 131
Databases are to use the set based logic. Try writing the same queries without cursors.
What your code does:
-1: Reads the records from measurement_xxxx.... table and for each one it executes 5 insert.
How this can be writen with set logic:
-1: make the first insert for all the records from measurement_XXXX....
-2: make the second insert for all the records from measurement_XXX....
.......
This will change the code like the following: ( I did not test the code, so it is a guideline )
CREATE PROCEDURE `using_set_logic`(
IN c_id INT UNSIGNED;
IN t_id INT UNSIGNED;
)
begin
-- the first insert
insert into data_headers(DataHeader_Name)
select DataHeader_Name FROM `measurements`.`measurement_20131029_152902
-- the second insert if the dataheadername is uniqueu
insert into tool_header_link(DataHeader_ID, Tool_ID)
select DataHeader_ID, t_id from data_headers where DataHeader_Name in (
select DataHeader_Name FROM `measurements`.`measurement_20131029_152902
)
--the second insert if the dataheadername isnot unique.
--take the last id for the dataheaders before the first insert
--take the last id for the dataheaders after the insert
--use those values to bound the data header id.
insert into tool_header_link(DataHeader_ID, Tool_ID)
select DataHeader_ID, @t_id from data_headers
where DataHeader_id between @beforeFirstInsert and @afterFirstInsert
end
Upvotes: 2