Reputation: 1939
There is a un normalized table in my database with name details
structure and sample data as below(apologies for image, just thought it would be more comprehensible):
My challenge is to split columns - assignee, inventor and ipcsubclass
using delimitor |
into new tables {detail_inv
and inventors
}, {detail_asg
and assignees
} and {detail_ipc
and ipcsubclasses
}.
In all three cases, table schemas are similar. For example, columns on inventors table- id
and name
and on detail_inv table- detail_id
and inventor_id
. There must be only one name per row with all names unique in inventors table and ids to hold relationship in detail_inv table.
I tried stored procedure with below code for inventors- I made 3 procedures for 3 columns:(
drop procedure if exists normalise_details;
delimiter #
create procedure normalise_details()
proc_main:begin
declare v_cursor_done int unsigned default 0;
declare v_post_id int unsigned;
declare v_tags varchar(2048);
declare v_keyword varchar(50);
declare v_keyword_id mediumint unsigned;
declare v_tags_done int unsigned;
declare v_tags_idx int unsigned;
declare v_cursor cursor for select id, inventor from details order by id;
declare continue handler for not found set v_cursor_done = 1;
set autocommit = 0;
open v_cursor;
repeat
fetch v_cursor into v_post_id, v_tags;
set v_tags_done = 0;
set v_tags_idx = 1;
while not v_tags_done do
set v_keyword = substring(v_tags, v_tags_idx,
if(locate('|', v_tags, v_tags_idx) > 0,
locate('|', v_tags, v_tags_idx) - v_tags_idx,
length(v_tags)));
if length(v_keyword) > 0 then
set v_tags_idx = v_tags_idx + length(v_keyword) + 1;
set v_keyword = trim(v_keyword);
insert into inventors (name) values (v_keyword);
select id into v_keyword_id from inventors where name = v_keyword;
insert into details_inv (inventor_id, detail_id) values (v_keyword_id, v_post_id);
else
set v_tags_done = 1;
end if;
end while;
until v_cursor_done end repeat;
close v_cursor;
commit;
end proc_main #
delimiter ;
When I try this on some random test data, it works fine. when i do this on actual table, it doesn't work well. Only partial data is inserted. SQL throws no errors (except for some times: "#1172 - Result consisted of more than one row" or "inventor_id column can't be null")
I tried modifying the code at MySQL - Insert comma separated list into normalized tables via stored procedure to suit my needs but i failed.
Please help me, my DB table has become a mess and there are about 500,000 rows making it really difficult for me to explode and manage huge arrays on each project(recent project with ~200,000 rows).
Upvotes: 2
Views: 5387
Reputation: 28196
Looking at RolandoMySQLDBA's post to this dba.stackexchange question I feel confirmed in my initial reservations regarding triggered stored procedures. However, if you are certain that at any given time only a few lines are changed by user input then it should be possible to put together a fast acting procedure.
However, if there are many users working in parallel they might still lock each other out. I don't know whether this will really happen since the stored procedure will not change anything in the details
table. If necessary you could check out this page for ideas.
Edit: TRIGGER
I just extended the SQLfiddle of my previous post to this SQLfiddle with trigger, containing the following:
CREATE TRIGGER normdet AFTER INSERT ON detail FOR EACH ROW
BEGIN
DECLARE n int; DECLARE word VARCHAR(64)
;SET n=cntparts(NEW.inventor)
;WHILE n>0 DO
SET word=part(new.inventor,n)
;IF NOT EXISTS (SELECT * FROM inv WHERE invname=word) THEN
INSERT INTO inv (invname) VALUES (word)
;END IF
;INSERT INTO det2inv (didid,diiid)
SELECT NEW.id,invid FROM inv WHERE invname=word
;SET n=n-1
;END WHILE
-- and similar loops for assignee and cls ...
;END;
I also defined another function
CREATE FUNCTION cntparts (var varchar(1024)) RETURNS int
RETURN 1+LENGTH(var)-LENGTH(REPLACE(var,'|',''));
counting the words in a given varchar
. This can also be used to create loops instead of my fixed UNION
constructs for the base conversion in my first post.
The trigger now takes care of all new INSERT
s. A similar trigger still needs to be written to do the same for UPDATE
s. That should not be too hard to do ...
In my SQLfiddle I inserted another row into detail
after the trigger definition. The results are listed by two comparative SELECT statements, see fiddle.
reply to last comment:
Well, as I suggested in my original answer, you should first import all of the data (without having any triggers installed!!!!) and then plough through the detail
-table with the SELECT/UNION
statements. Before you do that you should find out the maximum number of words in each of the columns assignee
,inventor
and ipsubclass
by using
SELECT MAX(cntparts(inventor)) invcnt,
MAX(cntparts(assignee)) asscnt,
MAX(cntparts(ipsubclass)) clscnt
FROM detail
You can then adjust the number of SELECT/UNION
statments needed for each column. Then fill the link tables as shown in the SQLfiddle.
Maybe the whole process takes a while but you can safely work on one table after another (first the actual attribute table and then the associated link-table).
After that you can activate your trigger which should then only work on individually added lines.
Upvotes: 3
Reputation: 28196
First of all it seems to me you should divide your table into four separate ones:
detail
(main table, containing: id, projectid,
publicationnumber, prioritycountry, prioritydate
and status
) inv
(inventor table, containing: invid, invname
and possibly more
inventor-related data) cls
(ipsubclass table, containing: clsid,
clsname
and possibly a description of each class)assignee
(containing data of assignee companies like: assid, assname ...
)Since there will be n:m
relationships between the main table and the inv, cls
and assignee
you should also set up link tables holding the relationships like
The restructuring task can be broken down into several steps:
First of all you need to apply a user defined function for splitting the combined values. You can do that with a function as described here
I simplifed it a bit further, since in your example we are only confronted with a single separator |
:
CREATE FUNCTION part( x VARCHAR(255), pos INT)
RETURNS VARCHAR(255) BEGIN
DECLARE delim char(1)
;SET delim='|'
;RETURN TRIM(REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
delim, ''))
;END;
(Notice the TRIM
function to chop away any unwanted blanks ...)
Next, you should define the target tables holding your inventors and possible ipsubclasses (... and assignees, which I have not done yet):
CREATE TABLE inv (invid int auto_increment PRIMARY KEY, invname nvarchar(64));
CREATE TABLE cls (clsid int auto_increment PRIMARY KEY, clsname nvarchar(6));
Feel free to extend the tables with additional columns, just as you need them.
Now we fill the tables with unique values. First the inventors in table inv
:
INSERT INTO inv (invname)
SELECT inv FROM (
SELECT part(inventor,1) inv from detail
UNION
SELECT part(inventor,2) from detail
UNION
SELECT part(inventor,3) from detail
UNION
SELECT part(inventor,4) from detail
UNION
SELECT part(inventor,5) from detail
UNION
SELECT part(inventor,6) from detail
UNION
SELECT part(inventor,7) from detail
UNION
SELECT part(inventor,8) from detail
) t WHERE inv>'' ORDER BY inv;
next the ipsubclasses:
INSERT INTO cls (clsname)
SELECT icls FROM (
SELECT part(iclass,1) icls from detail
UNION
SELECT part(iclass,2) from detail
UNION
SELECT part(iclass,3) from detail
UNION
SELECT part(iclass,4) from detail
UNION
SELECT part(iclass,5) from detail
UNION
SELECT part(iclass,6) from detail
UNION
SELECT part(iclass,7) from detail
UNION
SELECT part(iclass,8) from detail
) t WHERE icls>'' ORDER BY icls;
In my example I only looked at the first 8 entries of each field. This can be modified to suit your needs of course. You will end up with two uniquely numbered tables containing all possible inventors and all possible ipsubclasses (and in an analogous way all assignees). You can have a look at my SQLfiddle here: http://sqlfiddle.com/#!2/aeafe/1
The remaining task is now to fill the link tables with suitable keys (pairs of ids from main table details
and its attribute tables inv, cls
and assignee
.
Edit
The link tables are being filled with the following statements:
INSERT INTO det2inv (didid,diiid)
SELECT id,invid FROM inv
INNER JOIN detail ON INSTR(inventor,invname)>0;
INSERT INTO det2cls (dcdid,dccid)
SELECT id,clsid FROM cls
INNER JOIN detail ON INSTR(iclass,clsname)>0;
-- ... and a similar one for det2ass
The INSTR()
function will not work perfectly as names like Hagen, Pete
will be matched successfully against Gleichenhagen, Peter
. To avoid these cases the comparison should be modifed as shown here:
...
INNER JOIN detail ON INSTR(REPLACE(CONCAT('|',inventor,'|'),' ',''),
REPLACE(CONCAT('|',invname,'|'),' ',''))>0;
You can see the complete working example here: http://sqlfiddle.com/#!2/097be/8
Upvotes: 2