Reputation: 795
I have a file that has information that is primarily pipe delimited. But some fields are further delimited within. I guess you can call them nested delimited? This is how the rows in my file look.
FieldA|FieldB|[FieldC111~FieldC112~FieldC113][FieldC121~FieldC122~FieldC123]|[FieldD111~FieldD112~FieldD113][FieldD121~FieldD122~FieldD123]|FieldE|FieldF
FieldA|FieldB|[FieldC111~FieldC112~FieldC113][FieldC121~FieldC122~FieldC123][FieldC131~FieldC132~FieldC133]|[FieldD111~FieldD112~FieldD113][FieldD121~FieldD122~FieldD123]|FieldE|FieldF
FieldA|FieldB|[FieldC111~FieldC112~FieldC113][FieldC121~FieldC122~FieldC123][FieldC131~FieldC132~FieldC133]|[FieldD111~FieldD112~FieldD113][FieldD121~FieldD122~FieldD123][FieldD131~FieldD132~FieldD133]|FieldE|FieldF
FieldA|FieldB|[FieldC111~FieldC112~FieldC113][FieldC121~FieldC122~FieldC123]|[FieldD111~FieldD112~FieldD113][FieldD121~FieldD122~FieldD123][FieldD131~FieldD132~FieldD133]|FieldE|FieldF
Basically Field C and Field D, encapsulate X sets of fields within, where X ranges from 1-10. Each set is encapsulated by square [ ] brackets. The no. of fields within the square brackets remain the same (delimited by tilde ~).
My table in Oracle should look like this
FieldA|FieldB|FieldC111|FieldC112|FieldC113|FieldC121|FieldC122|FieldC123|FieldC131|FieldC132|FieldC133|FieldD111|FieldD112|FieldD113|FieldD121|FieldD122|FieldD123|FieldD131|FieldD132|FieldD133|
I'm not very strong at UNIX (basically google every single command) but I've come up with the following commands that help me delimit the file (ignoring the tilde problem till i sort out the encapsulated fields problem, cause I can use sql functions to read between the tilde symbols)
sed 's/\]\[/|/g' final.txt > final1.txt -- replaces all ][ with |
sed 's/\]|\[/|/g' final1.txt > final2.txt -- replaces all ]|[ with |
sed 's/|\[/|/g' final2.txt > final3.txt -- replaces all |[ with |
sed 's/\]|/|/g' final3.txt > final4.txt -- replaces all ]| with |
This gives me the file in a completely tab delimited format. However, the problem arises when I load the file, the number of delimiters varies (as the number of encapsulated fields in Field C and D varies) and hence data would just get loaded in a first come first serve basis.
Example, if I had 1 field encapsulated in Field C, and 3 Fields encapsulated in field D, my table would result in all 3 columns in Field C and one column in Field D containing data. What I want to happen is that if there is only 1 field encapsulated in Field C (but 3 fields provisioned for Field C) the other 2 fields of Field C should be blank, and the fields encapsulated in Field D should go to their respective D fields.
I have a few other solutions in mind, like splitting files, loading it into 3-4 different tables, then combining all 3 tables. But that would be a lot of extra work. Was wondering if there was a simpler way of doing this
I hope I've explained that clear enough. Its quite complex for me to put it into words :(
Some additional information about the data file I'm dealing with.
Upvotes: 0
Views: 3670
Reputation: 191275
If the file is on the database server and you can get it into a location that corresponds to an Oracle directory
object, you could load the initial pipe-delimited as an external table, something like:
create table my_external_table (
A varchar2(10),
B varchar2(10),
C varchar2(4000),
D varchar2(4000),
E varchar2(10),
F varchar2(10)
)
organization external (
type oracle_loader
default directory my_directory
access parameters (
records delimited by newline
fields terminated by '|'
)
location ('my_file.dat')
);
The brute-force the extraction of the nested-delimited data, in two steps, and insert that into your real table:
insert into my_table (a, b,
c0101, c0102, c0103, c0201, c0202, c0203, c0301, c0302, c0303,
d0101, d0102, d0103, d0201, d0202, d0203, d0301, d0302, d0303,
e, f)
with t as (
select a, b,
regexp_replace(regexp_substr(c, '\[.*?\]', 1, 1), '[][]') as c01,
regexp_replace(regexp_substr(c, '\[.*?\]', 1, 2), '[][]') as c02,
regexp_replace(regexp_substr(c, '\[.*?\]', 1, 3), '[][]') as c03,
regexp_replace(regexp_substr(d, '\[.*?\]', 1, 1), '[][]') as d01,
regexp_replace(regexp_substr(d, '\[.*?\]', 1, 2), '[][]') as d02,
regexp_replace(regexp_substr(d, '\[.*?\]', 1, 3), '[][]') as d03,
e, f
from my_external_table
)
select a, b,
regexp_substr(c01, '[^~]+', 1, 1) as c0101,
regexp_substr(c01, '[^~]+', 1, 2) as c0102,
regexp_substr(c01, '[^~]+', 1, 3) as c0103,
regexp_substr(c02, '[^~]+', 1, 1) as c0201,
regexp_substr(c02, '[^~]+', 1, 2) as c0202,
regexp_substr(c02, '[^~]+', 1, 3) as c0203,
regexp_substr(c03, '[^~]+', 1, 1) as c0301,
regexp_substr(c03, '[^~]+', 1, 2) as c0302,
regexp_substr(c03, '[^~]+', 1, 3) as c0303,
regexp_substr(d01, '[^~]+', 1, 1) as d0101,
regexp_substr(d01, '[^~]+', 1, 2) as d0102,
regexp_substr(d01, '[^~]+', 1, 3) as d0103,
regexp_substr(d02, '[^~]+', 1, 1) as d0201,
regexp_substr(d02, '[^~]+', 1, 2) as d0202,
regexp_substr(d02, '[^~]+', 1, 3) as d0203,
regexp_substr(d03, '[^~]+', 1, 1) as d0301,
regexp_substr(d03, '[^~]+', 1, 2) as d0302,
regexp_substr(d03, '[^~]+', 1, 3) as d0303,
e, f
from t;
The CTE splits out the separate C fields, and the main select then splits each of those out based on the tildes. I've only shown three C/D levels to same space, but you'd have to repeat the rows in both the CTE and main query. Which will make it a very long statement. And it involves a lot of cut-and-paste and risks introducing errors if the numbers aren't all just right; but you could generate the statement from a shell script (or as dynamic SQL) to avoid both issues.
This also assumes that the total length of the C/D fields on a particular line never exceeds 4k; if it could then that would make things a bit more complicated.
Anyway, for the data you showed, the final table would have (with apologies to those who quite rightly dislike scrolling):
select * from my_table;
A B C0101 C0102 C0103 C0201 C0202 C0203 C0301 C0302 C0303 D0101 D0102 D0103 D0201 D0202 D0203 D0301 D0302 D0303 E F
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ----------
FieldA FieldB FieldC111 FieldC112 FieldC113 FieldC121 FieldC122 FieldC123 FieldD111 FieldD112 FieldD113 FieldD121 FieldD122 FieldD123 FieldE FieldF
FieldA FieldB FieldC111 FieldC112 FieldC113 FieldC121 FieldC122 FieldC123 FieldC131 FieldC132 FieldC133 FieldD111 FieldD112 FieldD113 FieldD121 FieldD122 FieldD123 FieldE FieldF
FieldA FieldB FieldC111 FieldC112 FieldC113 FieldC121 FieldC122 FieldC123 FieldC131 FieldC132 FieldC133 FieldD111 FieldD112 FieldD113 FieldD121 FieldD122 FieldD123 FieldD131 FieldD132 FieldD133 FieldE FieldF
FieldA FieldB FieldC111 FieldC112 FieldC113 FieldC121 FieldC122 FieldC123 FieldD111 FieldD112 FieldD113 FieldD121 FieldD122 FieldD123 FieldD131 FieldD132 FieldD133 FieldE FieldF
If you wanted to generate the insert dynamically you could do this, which just recreates the manual version above - but for 10 sets of C/D fields:
declare
stmt varchar2(32767);
begin
stmt := 'insert into my_table (a, b, ';
for i in 1..10 loop -- 10 sets of C/D records
for j in 1..3 loop -- 3 tilde-delimited values for each
stmt := stmt || 'c' || lpad(i, 2, '0') || lpad(j, 2, '0') || ', ';
stmt := stmt || 'd' || lpad(i, 2, '0') || lpad(j, 2, '0') || ', ';
end loop;
end loop;
stmt := stmt || 'e, f) with t as (select a, b, ';
for i in 1..10 loop -- 10 sets of C/D records
stmt := stmt || 'regexp_replace(regexp_substr(c, ''\[.*?\]'', 1, ' || i
|| '), ''[][]'') as c' || lpad(i, 2, '0') || ', ';
stmt := stmt || 'regexp_replace(regexp_substr(d, ''\[.*?\]'', 1, ' || i
|| '), ''[][]'') as d' || lpad(i, 2, '0') || ', ';
end loop;
stmt := stmt || 'e, f from my_external_table) select a, b, ';
for i in 1..10 loop -- 10 sets of C/D records
for j in 1..3 loop -- 3 tilde-delimited values for each
stmt := stmt || 'regexp_substr(c' || lpad(i, 2, '0')
|| ', ''[^~]+'', 1, ' || j || '), ';
stmt := stmt || 'regexp_substr(d' || lpad(i, 2, '0')
|| ', ''[^~]+'', 1, ' || j || '), ';
end loop;
end loop;
stmt := stmt || 'e, f from t';
-- uncomment to see/debug the actual statement being executed
-- dbms_output.put_line(stmt);
execute immediate stmt;
end;
/
Running that creates the same records in the real table.
Upvotes: 2