Fr0zenFyr
Fr0zenFyr

Reputation: 1939

MySQL: stored procedure to split column(with delimiters) and insert into new table

There is a un normalized table in my database with name details structure and sample data as below(apologies for image, just thought it would be more comprehensible): **Schema**

My challenge is to split columns - assignee, inventor and ipcsubclass using delimitor | into new tables {detail_inv and inventors}, {detail_asg and assignees} and {detail_ipc and ipcsubclasses}.

In all three cases, table schemas are similar. For example, columns on inventors table- id and name and on detail_inv table- detail_id and inventor_id. There must be only one name per row with all names unique in inventors table and ids to hold relationship in detail_inv table.

I tried stored procedure with below code for inventors- I made 3 procedures for 3 columns:(

drop procedure if exists normalise_details;

delimiter #

create procedure normalise_details()
proc_main:begin

declare v_cursor_done int unsigned default 0;
declare v_post_id int unsigned;
declare v_tags varchar(2048);
declare v_keyword varchar(50);

declare v_keyword_id mediumint unsigned;

declare v_tags_done int unsigned;
declare v_tags_idx int unsigned;

declare v_cursor cursor for select id, inventor from details order by id;
declare continue handler for not found set v_cursor_done = 1;

set autocommit = 0; 

open v_cursor;
repeat

  fetch v_cursor into v_post_id, v_tags;
  set v_tags_done = 0;       
  set v_tags_idx = 1;

  while not v_tags_done do

    set v_keyword = substring(v_tags, v_tags_idx, 
      if(locate('|', v_tags, v_tags_idx) > 0, 
        locate('|', v_tags, v_tags_idx) - v_tags_idx, 
        length(v_tags)));

      if length(v_keyword) > 0 then

        set v_tags_idx = v_tags_idx + length(v_keyword) + 1;

        set v_keyword = trim(v_keyword);
        insert into inventors (name) values (v_keyword);

        select id into v_keyword_id from inventors where name = v_keyword;
        insert into details_inv (inventor_id, detail_id) values (v_keyword_id, v_post_id);

      else
        set v_tags_done = 1;
      end if;

  end while;

until v_cursor_done end repeat;

close v_cursor;

commit;

end proc_main #


delimiter ;

When I try this on some random test data, it works fine. when i do this on actual table, it doesn't work well. Only partial data is inserted. SQL throws no errors (except for some times: "#1172 - Result consisted of more than one row" or "inventor_id column can't be null")

I tried modifying the code at MySQL - Insert comma separated list into normalized tables via stored procedure to suit my needs but i failed.

Please help me, my DB table has become a mess and there are about 500,000 rows making it really difficult for me to explode and manage huge arrays on each project(recent project with ~200,000 rows).

Upvotes: 2

Views: 5387

Answers (2)

Carsten Massmann
Carsten Massmann

Reputation: 28196

Looking at RolandoMySQLDBA's post to this dba.stackexchange question I feel confirmed in my initial reservations regarding triggered stored procedures. However, if you are certain that at any given time only a few lines are changed by user input then it should be possible to put together a fast acting procedure.

However, if there are many users working in parallel they might still lock each other out. I don't know whether this will really happen since the stored procedure will not change anything in the details table. If necessary you could check out this page for ideas.

Edit: TRIGGER

I just extended the SQLfiddle of my previous post to this SQLfiddle with trigger, containing the following:

CREATE TRIGGER normdet AFTER INSERT ON detail FOR EACH ROW
BEGIN
  DECLARE n int; DECLARE word VARCHAR(64)

 ;SET n=cntparts(NEW.inventor)
 ;WHILE n>0 DO
   SET word=part(new.inventor,n)
   ;IF NOT EXISTS (SELECT * FROM inv WHERE invname=word) THEN
     INSERT INTO inv (invname) VALUES (word)
   ;END IF
   ;INSERT INTO det2inv (didid,diiid) 
    SELECT NEW.id,invid FROM inv WHERE invname=word
   ;SET n=n-1
 ;END WHILE
  -- and similar loops for assignee and cls ...
;END;

I also defined another function

CREATE FUNCTION cntparts (var varchar(1024)) RETURNS int
RETURN 1+LENGTH(var)-LENGTH(REPLACE(var,'|',''));

counting the words in a given varchar. This can also be used to create loops instead of my fixed UNION constructs for the base conversion in my first post.

The trigger now takes care of all new INSERTs. A similar trigger still needs to be written to do the same for UPDATEs. That should not be too hard to do ...

In my SQLfiddle I inserted another row into detail after the trigger definition. The results are listed by two comparative SELECT statements, see fiddle.

reply to last comment:

Well, as I suggested in my original answer, you should first import all of the data (without having any triggers installed!!!!) and then plough through the detail-table with the SELECT/UNION statements. Before you do that you should find out the maximum number of words in each of the columns assignee,inventor and ipsubclass by using

SELECT MAX(cntparts(inventor)) invcnt,
       MAX(cntparts(assignee)) asscnt,
       MAX(cntparts(ipsubclass)) clscnt 
FROM detail

You can then adjust the number of SELECT/UNION statments needed for each column. Then fill the link tables as shown in the SQLfiddle.

Maybe the whole process takes a while but you can safely work on one table after another (first the actual attribute table and then the associated link-table).

After that you can activate your trigger which should then only work on individually added lines.

Upvotes: 3

Carsten Massmann
Carsten Massmann

Reputation: 28196

First of all it seems to me you should divide your table into four separate ones:

  1. detail (main table, containing: id, projectid, publicationnumber, prioritycountry, prioritydate and status)
  2. inv (inventor table, containing: invid, invname and possibly more inventor-related data)
  3. cls (ipsubclass table, containing: clsid, clsname and possibly a description of each class)
  4. assignee (containing data of assignee companies like: assid, assname ...)

Since there will be n:m relationships between the main table and the inv, cls and assignee you should also set up link tables holding the relationships like

  • det2inv
  • det2cls
  • det2ass

The restructuring task can be broken down into several steps:

First of all you need to apply a user defined function for splitting the combined values. You can do that with a function as described here

I simplifed it a bit further, since in your example we are only confronted with a single separator |:

CREATE FUNCTION part( x VARCHAR(255), pos INT) 
RETURNS VARCHAR(255) BEGIN
 DECLARE delim char(1)
 ;SET delim='|'
 ;RETURN TRIM(REPLACE(SUBSTRING(SUBSTRING_INDEX(x, delim, pos),
                   LENGTH(SUBSTRING_INDEX(x, delim, pos -1)) + 1),
       delim, ''))
 ;END;

(Notice the TRIM function to chop away any unwanted blanks ...)

Next, you should define the target tables holding your inventors and possible ipsubclasses (... and assignees, which I have not done yet):

CREATE TABLE inv (invid int auto_increment PRIMARY KEY, invname nvarchar(64));
CREATE TABLE cls (clsid int auto_increment PRIMARY KEY, clsname nvarchar(6));

Feel free to extend the tables with additional columns, just as you need them.

Now we fill the tables with unique values. First the inventors in table inv:

INSERT INTO inv (invname) 
SELECT inv FROM (
 SELECT part(inventor,1) inv from detail
 UNION 
 SELECT part(inventor,2) from detail
 UNION 
 SELECT part(inventor,3) from detail
 UNION 
 SELECT part(inventor,4) from detail
 UNION 
 SELECT part(inventor,5) from detail
 UNION 
 SELECT part(inventor,6) from detail
 UNION 
 SELECT part(inventor,7) from detail
 UNION 
 SELECT part(inventor,8) from detail
) t WHERE inv>'' ORDER BY inv;

next the ipsubclasses:

INSERT INTO cls (clsname)
SELECT icls FROM (
 SELECT part(iclass,1) icls from detail
 UNION 
 SELECT part(iclass,2) from detail
 UNION 
 SELECT part(iclass,3) from detail
 UNION 
 SELECT part(iclass,4) from detail
 UNION 
 SELECT part(iclass,5) from detail
 UNION 
 SELECT part(iclass,6) from detail
 UNION 
 SELECT part(iclass,7) from detail
 UNION 
 SELECT part(iclass,8) from detail
) t WHERE icls>'' ORDER BY icls;

In my example I only looked at the first 8 entries of each field. This can be modified to suit your needs of course. You will end up with two uniquely numbered tables containing all possible inventors and all possible ipsubclasses (and in an analogous way all assignees). You can have a look at my SQLfiddle here: http://sqlfiddle.com/#!2/aeafe/1

The remaining task is now to fill the link tables with suitable keys (pairs of ids from main table details and its attribute tables inv, clsand assignee.

Edit

The link tables are being filled with the following statements:

INSERT INTO det2inv (didid,diiid)
SELECT id,invid FROM inv 
INNER JOIN detail ON INSTR(inventor,invname)>0;

INSERT INTO det2cls (dcdid,dccid)
SELECT id,clsid FROM cls 
INNER JOIN detail ON INSTR(iclass,clsname)>0;

-- ... and a similar one for det2ass

The INSTR() function will not work perfectly as names like Hagen, Pete will be matched successfully against Gleichenhagen, Peter. To avoid these cases the comparison should be modifed as shown here:

...
INNER JOIN detail ON INSTR(REPLACE(CONCAT('|',inventor,'|'),' ',''),
                           REPLACE(CONCAT('|',invname,'|'),' ',''))>0;

You can see the complete working example here: http://sqlfiddle.com/#!2/097be/8

Upvotes: 2

Related Questions