Joshua1729
Joshua1729

Reputation: 795

How to load delimited file (multiple delimiters) into Oracle?

I have a file that has information that is primarily pipe delimited. But some fields are further delimited within. I guess you can call them nested delimited? This is how the rows in my file look.

FieldA|FieldB|[FieldC111~FieldC112~FieldC113][FieldC121~FieldC122~FieldC123]|[FieldD111~FieldD112~FieldD113][FieldD121~FieldD122~FieldD123]|FieldE|FieldF
FieldA|FieldB|[FieldC111~FieldC112~FieldC113][FieldC121~FieldC122~FieldC123][FieldC131~FieldC132~FieldC133]|[FieldD111~FieldD112~FieldD113][FieldD121~FieldD122~FieldD123]|FieldE|FieldF
FieldA|FieldB|[FieldC111~FieldC112~FieldC113][FieldC121~FieldC122~FieldC123][FieldC131~FieldC132~FieldC133]|[FieldD111~FieldD112~FieldD113][FieldD121~FieldD122~FieldD123][FieldD131~FieldD132~FieldD133]|FieldE|FieldF
FieldA|FieldB|[FieldC111~FieldC112~FieldC113][FieldC121~FieldC122~FieldC123]|[FieldD111~FieldD112~FieldD113][FieldD121~FieldD122~FieldD123][FieldD131~FieldD132~FieldD133]|FieldE|FieldF

Basically Field C and Field D, encapsulate X sets of fields within, where X ranges from 1-10. Each set is encapsulated by square [ ] brackets. The no. of fields within the square brackets remain the same (delimited by tilde ~).

My table in Oracle should look like this

FieldA|FieldB|FieldC111|FieldC112|FieldC113|FieldC121|FieldC122|FieldC123|FieldC131|FieldC132|FieldC133|FieldD111|FieldD112|FieldD113|FieldD121|FieldD122|FieldD123|FieldD131|FieldD132|FieldD133|

I'm not very strong at UNIX (basically google every single command) but I've come up with the following commands that help me delimit the file (ignoring the tilde problem till i sort out the encapsulated fields problem, cause I can use sql functions to read between the tilde symbols)

sed 's/\]\[/|/g' final.txt > final1.txt -- replaces all ][ with |
sed 's/\]|\[/|/g' final1.txt > final2.txt -- replaces all ]|[ with |
sed 's/|\[/|/g' final2.txt > final3.txt -- replaces all |[ with |
sed 's/\]|/|/g' final3.txt > final4.txt -- replaces all ]| with |

This gives me the file in a completely tab delimited format. However, the problem arises when I load the file, the number of delimiters varies (as the number of encapsulated fields in Field C and D varies) and hence data would just get loaded in a first come first serve basis.

Example, if I had 1 field encapsulated in Field C, and 3 Fields encapsulated in field D, my table would result in all 3 columns in Field C and one column in Field D containing data. What I want to happen is that if there is only 1 field encapsulated in Field C (but 3 fields provisioned for Field C) the other 2 fields of Field C should be blank, and the fields encapsulated in Field D should go to their respective D fields.

I have a few other solutions in mind, like splitting files, loading it into 3-4 different tables, then combining all 3 tables. But that would be a lot of extra work. Was wondering if there was a simpler way of doing this

I hope I've explained that clear enough. Its quite complex for me to put it into words :(

Some additional information about the data file I'm dealing with.

Upvotes: 0

Views: 3670

Answers (1)

Alex Poole
Alex Poole

Reputation: 191275

If the file is on the database server and you can get it into a location that corresponds to an Oracle directory object, you could load the initial pipe-delimited as an external table, something like:

create table my_external_table (
    A varchar2(10),
    B varchar2(10),
    C varchar2(4000),
    D varchar2(4000),
    E varchar2(10),
    F varchar2(10)
)
organization external (
  type oracle_loader
  default directory my_directory
  access parameters (
    records delimited by newline
    fields terminated by '|'
  )
  location ('my_file.dat')
);

The brute-force the extraction of the nested-delimited data, in two steps, and insert that into your real table:

insert into my_table (a, b,
  c0101, c0102, c0103, c0201, c0202, c0203, c0301, c0302, c0303,
  d0101, d0102, d0103, d0201, d0202, d0203, d0301, d0302, d0303,
  e, f)
with t as (
  select a, b,
    regexp_replace(regexp_substr(c, '\[.*?\]', 1, 1), '[][]') as c01,
    regexp_replace(regexp_substr(c, '\[.*?\]', 1, 2), '[][]') as c02,
    regexp_replace(regexp_substr(c, '\[.*?\]', 1, 3), '[][]') as c03,
    regexp_replace(regexp_substr(d, '\[.*?\]', 1, 1), '[][]') as d01,
    regexp_replace(regexp_substr(d, '\[.*?\]', 1, 2), '[][]') as d02,
    regexp_replace(regexp_substr(d, '\[.*?\]', 1, 3), '[][]') as d03,
    e, f
  from my_external_table
)
select a, b,
  regexp_substr(c01, '[^~]+', 1, 1) as c0101,
  regexp_substr(c01, '[^~]+', 1, 2) as c0102,
  regexp_substr(c01, '[^~]+', 1, 3) as c0103,
  regexp_substr(c02, '[^~]+', 1, 1) as c0201,
  regexp_substr(c02, '[^~]+', 1, 2) as c0202,
  regexp_substr(c02, '[^~]+', 1, 3) as c0203,
  regexp_substr(c03, '[^~]+', 1, 1) as c0301,
  regexp_substr(c03, '[^~]+', 1, 2) as c0302,
  regexp_substr(c03, '[^~]+', 1, 3) as c0303,
  regexp_substr(d01, '[^~]+', 1, 1) as d0101,
  regexp_substr(d01, '[^~]+', 1, 2) as d0102,
  regexp_substr(d01, '[^~]+', 1, 3) as d0103,
  regexp_substr(d02, '[^~]+', 1, 1) as d0201,
  regexp_substr(d02, '[^~]+', 1, 2) as d0202,
  regexp_substr(d02, '[^~]+', 1, 3) as d0203,
  regexp_substr(d03, '[^~]+', 1, 1) as d0301,
  regexp_substr(d03, '[^~]+', 1, 2) as d0302,
  regexp_substr(d03, '[^~]+', 1, 3) as d0303,
  e, f
from t;

The CTE splits out the separate C fields, and the main select then splits each of those out based on the tildes. I've only shown three C/D levels to same space, but you'd have to repeat the rows in both the CTE and main query. Which will make it a very long statement. And it involves a lot of cut-and-paste and risks introducing errors if the numbers aren't all just right; but you could generate the statement from a shell script (or as dynamic SQL) to avoid both issues.

This also assumes that the total length of the C/D fields on a particular line never exceeds 4k; if it could then that would make things a bit more complicated.

Anyway, for the data you showed, the final table would have (with apologies to those who quite rightly dislike scrolling):

select * from my_table;

A          B          C0101      C0102      C0103      C0201      C0202      C0203      C0301      C0302      C0303      D0101      D0102      D0103      D0201      D0202      D0203      D0301      D0302      D0303      E          F        
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
FieldA     FieldB     FieldC111  FieldC112  FieldC113  FieldC121  FieldC122  FieldC123                                   FieldD111  FieldD112  FieldD113  FieldD121  FieldD122  FieldD123                                   FieldE     FieldF     
FieldA     FieldB     FieldC111  FieldC112  FieldC113  FieldC121  FieldC122  FieldC123  FieldC131  FieldC132  FieldC133  FieldD111  FieldD112  FieldD113  FieldD121  FieldD122  FieldD123                                   FieldE     FieldF     
FieldA     FieldB     FieldC111  FieldC112  FieldC113  FieldC121  FieldC122  FieldC123  FieldC131  FieldC132  FieldC133  FieldD111  FieldD112  FieldD113  FieldD121  FieldD122  FieldD123  FieldD131  FieldD132  FieldD133  FieldE     FieldF     
FieldA     FieldB     FieldC111  FieldC112  FieldC113  FieldC121  FieldC122  FieldC123                                   FieldD111  FieldD112  FieldD113  FieldD121  FieldD122  FieldD123  FieldD131  FieldD132  FieldD133  FieldE     FieldF     

If you wanted to generate the insert dynamically you could do this, which just recreates the manual version above - but for 10 sets of C/D fields:

declare
  stmt varchar2(32767);
begin
  stmt := 'insert into my_table (a, b, ';
  for i in 1..10 loop  -- 10 sets of C/D records
    for j in 1..3 loop  -- 3 tilde-delimited values for each
      stmt := stmt || 'c' || lpad(i, 2, '0') || lpad(j, 2, '0') || ', ';
      stmt := stmt || 'd' || lpad(i, 2, '0') || lpad(j, 2, '0') || ', ';
    end loop;
  end loop;
  stmt := stmt || 'e, f) with t as (select a, b, ';
  for i in 1..10 loop  -- 10 sets of C/D records
    stmt := stmt || 'regexp_replace(regexp_substr(c, ''\[.*?\]'', 1, ' || i
      || '), ''[][]'') as c' || lpad(i, 2, '0') || ', ';
    stmt := stmt || 'regexp_replace(regexp_substr(d, ''\[.*?\]'', 1, ' || i
      || '), ''[][]'') as d' || lpad(i, 2, '0') || ', ';
  end loop;
  stmt := stmt || 'e, f from my_external_table) select a, b, ';
  for i in 1..10 loop  -- 10 sets of C/D records
    for j in 1..3 loop  -- 3 tilde-delimited values for each
      stmt := stmt || 'regexp_substr(c' || lpad(i, 2, '0')
        || ', ''[^~]+'', 1, ' || j || '), ';
      stmt := stmt || 'regexp_substr(d' || lpad(i, 2, '0')
        || ', ''[^~]+'', 1, ' || j || '), ';
    end loop;
  end loop;
  stmt := stmt || 'e, f from t';

  -- uncomment to see/debug the actual statement being executed
  -- dbms_output.put_line(stmt);
  execute immediate stmt;
end;
/

Running that creates the same records in the real table.

Upvotes: 2

Related Questions