Reputation: 1276
I have an Oracle stored procedure where we have to parse an oddly formatted flat file line by line, due to the fact that it has varying amounts of columns that are dictated by different header structures. So, some example data from the file might look like this (and yes, those headers are inline in the file):
ID TYPE DATE VALUE1 VALUE2
1 A 3-1-2021 12:00:00 10 24
1 A 3-2-2021 12:00:00 25 25
1 A 3-3-2021 12:00:00 31 36
ID TYPE DATE VALUE1 UNNECESSARY_VALUE VALUE2
2 B 3-1-2021 12:00:00 10 24 14
2 B 3-2-2021 12:00:00 25 25 22
2 B 3-3-2021 12:00:00 31 36 12
ID TYPE DATE VALUE1 VALUE2
3 A 3-1-2021 12:00:00 10 24
3 A 3-2-2021 12:00:00 25 25
3 A 3-3-2021 12:00:00 31 36
ID TYPE DATE VALUE1 UNNECESSARY_VALUE1 UNNECESSARY_VALUE2 VALUE2
4 C 3-1-2021 12:00:00 10 21 22 38
4 C 3-2-2021 12:00:00 25 21 42 23
4 C 3-3-2021 12:00:00 31 22 32 34
We are getting this from an external entity that has simply refused to make this an easier file to ingest. What I'm doing right now is getting a line, determining if it's a header or not and tracking what is present in each line with some variables. The TYPE column will usually also coincide with the number of columns, telling us which ones we need to get. I'm also clearing the white space and making the line comma separated, as the amount of space isn't necessarily constant.
I'm using REGEXP_SUBSTR to separate each line and save several variables. Consequently, that requires several calls to REGEXP_SUBSTR. So, I have several calls to get different values like so:
--v_val_index and v_val2_index are based on header row positioning
v_id := REGEXP_SUBSTR(v_line, '(.*?,){'||(0)||'}([^,]*)', 1, 1, '', 2);
v_type := REGEXP_SUBSTR(v_line, '(.*?,){'||(1)||'}([^,]*)', 1, 1, '', 2);
v_date := REGEXP_SUBSTR(v_line, '(.*?,){'||(2)||'}([^,]*)', 1, 1, '', 2);
v_val1 := REGEXP_SUBSTR(v_line, '(.*?,){'||(v_val1_index)||'}([^,]*)', 1, 1, '', 2);
v_val2 := REGEXP_SUBSTR(v_line, '(.*?,){'||(v_val2_index)||'}([^,]*)', 1, 1, '', 2);
The function parses a 100 MB file in about 30 seconds, so execution time isn't terrible but I want to verify if there is room for improvement. Therefore, my question is two part. Instead of using REGEXP_SUBSTR, is there a way to break up the line into a custom type and then reference the pieces of the line by index? If so, will that make a noticeable performance improvement?
Upvotes: 0
Views: 33
Reputation: 167962
You can loop through each line and insert it into a table without converting the current format to a CSV format:
DECLARE
v_data CLOB;
v_line_start PLS_INTEGER := 1;
v_line_end PLS_INTEGER;
v_line VARCHAR2(32000);
c_pattern CONSTANT VARCHAR2(100) :=
'^(\d+)\s+(\S+)\s+(\d{1,2}-\d{1,2}-\d{4} \d{1,2}:\d{2}:\d{2})'
|| '\s+(\d+).*?(\d+)$';
BEGIN
-- Read in your file.
v_data := 'ID TYPE DATE VALUE1 VALUE2
1 A 3-1-2021 12:00:00 10 24
1 A 3-2-2021 12:00:00 25 25
1 A 3-3-2021 12:00:00 31 36
ID TYPE DATE VALUE1 UNNECESSARY_VALUE VALUE2
2 B 3-1-2021 12:00:00 10 24 14
2 B 3-2-2021 12:00:00 25 25 22
2 B 3-3-2021 12:00:00 31 36 12
ID TYPE DATE VALUE1 VALUE2
3 A 3-1-2021 12:00:00 10 24
3 A 3-2-2021 12:00:00 25 25
3 A 3-3-2021 12:00:00 31 36
ID TYPE DATE VALUE1 UNNECESSARY_VALUE1 UNNECESSARY_VALUE2 VALUE2
4 C 3-1-2021 12:00:00 10 21 22 38
4 C 3-2-2021 12:00:00 25 21 42 23
4 C 3-3-2021 12:00:00 31 22 32 34';
WHILE v_line_start > 0 LOOP
v_line_end := INSTR( v_data, CHR(10), v_line_start );
IF v_line_end > 0 THEN
v_line := DBMS_LOB.SUBSTR( v_data, v_line_end - v_line_start, v_line_start );
v_line_start := v_line_end + 1;
ELSE
v_line := DBMS_LOB.SUBSTR( v_data, LENGTH( v_data ) - v_line_start + 1, v_line_start );
v_line_start := 0;
END IF;
IF v_line LIKE 'ID TYPE DATE %' THEN
CONTINUE;
END IF;
INSERT INTO data (
id,
typ,
dt,
value1,
value2
) VALUES (
TO_NUMBER( REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 1 ) ),
REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 2 ),
TO_DATE(
REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 3 ),
'MM-DD-YYYY HH24:MI:SS'
),
TO_NUMBER( REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 4 ) ),
TO_NUMBER( REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 5 ) )
);
END LOOP;
END;
/
Which, given the table:
CREATE TABLE data (
id NUMBER,
typ VARCHAR2(10),
dt DATE,
value1 NUMBER,
value2 NUMBER
);
Then:
SELECT *
FROM data;
Outputs:
ID | TYP | DT | VALUE1 | VALUE2 -: | :-- | :------------------ | -----: | -----: 1 | A | 2021-03-01 12:00:00 | 10 | 24 1 | A | 2021-03-02 12:00:00 | 25 | 25 1 | A | 2021-03-03 12:00:00 | 31 | 36 2 | B | 2021-03-01 12:00:00 | 10 | 14 2 | B | 2021-03-02 12:00:00 | 25 | 22 2 | B | 2021-03-03 12:00:00 | 31 | 12 3 | A | 2021-03-01 12:00:00 | 10 | 24 3 | A | 2021-03-02 12:00:00 | 25 | 25 3 | A | 2021-03-03 12:00:00 | 31 | 36 4 | C | 2021-03-01 12:00:00 | 10 | 38 4 | C | 2021-03-02 12:00:00 | 25 | 23 4 | C | 2021-03-03 12:00:00 | 31 | 34
db<>fiddle here
Or, just using SUBSTR
:
DECLARE
v_data CLOB;
v_line_start PLS_INTEGER := 1;
v_line_end PLS_INTEGER;
v_line VARCHAR2(32000);
v_v2_pos PLS_INTEGER;
BEGIN
-- Read in your file.
v_data := 'ID TYPE DATE VALUE1 VALUE2
1 A 3-1-2021 12:00:00 10 24
1 A 3-2-2021 12:00:00 25 25
1 A 3-3-2021 12:00:00 31 36
ID TYPE DATE VALUE1 UNNECESSARY_VALUE VALUE2
2 B 3-1-2021 12:00:00 10 24 14
2 B 3-2-2021 12:00:00 25 25 22
2 B 3-3-2021 12:00:00 31 36 12
ID TYPE DATE VALUE1 VALUE2
3 A 3-1-2021 12:00:00 10 24
3 A 3-2-2021 12:00:00 25 25
3 A 3-3-2021 12:00:00 31 36
ID TYPE DATE VALUE1 UNNECESSARY_VALUE1 UNNECESSARY_VALUE2 VALUE2
4 C 3-1-2021 12:00:00 10 21 22 38
4 C 3-2-2021 12:00:00 25 21 42 23
4 C 3-3-2021 12:00:00 31 22 32 34';
WHILE v_line_start > 0 LOOP
v_line_end := INSTR( v_data, CHR(10), v_line_start );
IF v_line_end > 0 THEN
v_line := DBMS_LOB.SUBSTR( v_data, v_line_end - v_line_start, v_line_start );
v_line_start := v_line_end + 1;
ELSE
v_line := DBMS_LOB.SUBSTR( v_data, LENGTH( v_data ) - v_line_start + 1, v_line_start );
v_line_start := 0;
END IF;
IF v_line LIKE 'ID TYPE DATE %' THEN
v_v2_pos := INSTR( v_line, ' ', -1 ) + 1;
CONTINUE;
END IF;
INSERT INTO data (
id,
typ,
dt,
value1,
value2
) VALUES (
TO_NUMBER( SUBSTR( v_line, 1, 4 ) ),
RTRIM( SUBSTR( v_line, 5, 5 ) ),
TO_DATE( SUBSTR( v_line, 10, 18 ), 'MM-DD-YYYY HH24:MI:SS' ),
TO_NUMBER( SUBSTR( v_line, 30, 9 ) ),
TO_NUMBER( SUBSTR( v_line, v_v2_pos ) )
);
END LOOP;
END;
/
db<>fiddle here
Upvotes: 1