SausageBuscuit
SausageBuscuit

Reputation: 1276

Oracle selecting all values from a CSV line into custom type

I have an Oracle stored procedure where we have to parse an oddly formatted flat file line by line, due to the fact that it has varying amounts of columns that are dictated by different header structures. So, some example data from the file might look like this (and yes, those headers are inline in the file):

ID  TYPE DATE                VALUE1   VALUE2
1   A    3-1-2021 12:00:00   10       24
1   A    3-2-2021 12:00:00   25       25
1   A    3-3-2021 12:00:00   31       36
ID  TYPE DATE                VALUE1   UNNECESSARY_VALUE VALUE2
2   B    3-1-2021 12:00:00   10       24                14
2   B    3-2-2021 12:00:00   25       25                22
2   B    3-3-2021 12:00:00   31       36                12
ID  TYPE DATE                VALUE1   VALUE2
3   A    3-1-2021 12:00:00   10       24
3   A    3-2-2021 12:00:00   25       25
3   A    3-3-2021 12:00:00   31       36
ID  TYPE DATE                VALUE1   UNNECESSARY_VALUE1 UNNECESSARY_VALUE2 VALUE2
4   C    3-1-2021 12:00:00   10       21                 22                 38
4   C    3-2-2021 12:00:00   25       21                 42                 23
4   C    3-3-2021 12:00:00   31       22                 32                 34

We are getting this from an external entity that has simply refused to make this an easier file to ingest. What I'm doing right now is getting a line, determining if it's a header or not and tracking what is present in each line with some variables. The TYPE column will usually also coincide with the number of columns, telling us which ones we need to get. I'm also clearing the white space and making the line comma separated, as the amount of space isn't necessarily constant.

I'm using REGEXP_SUBSTR to separate each line and save several variables. Consequently, that requires several calls to REGEXP_SUBSTR. So, I have several calls to get different values like so:

--v_val_index and v_val2_index are based on header row positioning
v_id   := REGEXP_SUBSTR(v_line, '(.*?,){'||(0)||'}([^,]*)', 1, 1, '', 2);
v_type := REGEXP_SUBSTR(v_line, '(.*?,){'||(1)||'}([^,]*)', 1, 1, '', 2);
v_date := REGEXP_SUBSTR(v_line, '(.*?,){'||(2)||'}([^,]*)', 1, 1, '', 2);
v_val1 := REGEXP_SUBSTR(v_line, '(.*?,){'||(v_val1_index)||'}([^,]*)', 1, 1, '', 2);
v_val2 := REGEXP_SUBSTR(v_line, '(.*?,){'||(v_val2_index)||'}([^,]*)', 1, 1, '', 2);

The function parses a 100 MB file in about 30 seconds, so execution time isn't terrible but I want to verify if there is room for improvement. Therefore, my question is two part. Instead of using REGEXP_SUBSTR, is there a way to break up the line into a custom type and then reference the pieces of the line by index? If so, will that make a noticeable performance improvement?

Upvotes: 0

Views: 33

Answers (1)

MT0
MT0

Reputation: 167962

You can loop through each line and insert it into a table without converting the current format to a CSV format:

DECLARE
  v_data CLOB;
  v_line_start PLS_INTEGER := 1;
  v_line_end   PLS_INTEGER;
  v_line       VARCHAR2(32000);
  c_pattern    CONSTANT VARCHAR2(100) :=
                 '^(\d+)\s+(\S+)\s+(\d{1,2}-\d{1,2}-\d{4} \d{1,2}:\d{2}:\d{2})'
                 || '\s+(\d+).*?(\d+)$';
BEGIN
  -- Read in your file.
  v_data := 'ID  TYPE DATE                VALUE1   VALUE2
1   A    3-1-2021 12:00:00   10       24
1   A    3-2-2021 12:00:00   25       25
1   A    3-3-2021 12:00:00   31       36
ID  TYPE DATE                VALUE1   UNNECESSARY_VALUE VALUE2
2   B    3-1-2021 12:00:00   10       24                14
2   B    3-2-2021 12:00:00   25       25                22
2   B    3-3-2021 12:00:00   31       36                12
ID  TYPE DATE                VALUE1   VALUE2
3   A    3-1-2021 12:00:00   10       24
3   A    3-2-2021 12:00:00   25       25
3   A    3-3-2021 12:00:00   31       36
ID  TYPE DATE                VALUE1   UNNECESSARY_VALUE1 UNNECESSARY_VALUE2 VALUE2
4   C    3-1-2021 12:00:00   10       21                 22                 38
4   C    3-2-2021 12:00:00   25       21                 42                 23
4   C    3-3-2021 12:00:00   31       22                 32                 34';

  WHILE v_line_start > 0 LOOP
    v_line_end := INSTR( v_data, CHR(10), v_line_start );
    IF v_line_end > 0 THEN
      v_line := DBMS_LOB.SUBSTR( v_data, v_line_end - v_line_start, v_line_start );
      v_line_start := v_line_end + 1;
    ELSE
      v_line := DBMS_LOB.SUBSTR( v_data, LENGTH( v_data ) - v_line_start + 1, v_line_start );
      v_line_start := 0;
    END IF;
    IF v_line LIKE 'ID  TYPE DATE %' THEN
      CONTINUE;
    END IF;


    INSERT INTO data (
      id,
      typ,
      dt,
      value1,
      value2
    ) VALUES (
      TO_NUMBER( REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 1 ) ),
      REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 2 ),
      TO_DATE(
        REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 3 ),
        'MM-DD-YYYY HH24:MI:SS'
      ),
      TO_NUMBER( REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 4 ) ),
      TO_NUMBER( REGEXP_SUBSTR( v_line, c_pattern, 1, 1, NULL, 5 ) )
    );
  END LOOP;
END;
/

Which, given the table:

CREATE TABLE data (
  id     NUMBER,
  typ    VARCHAR2(10),
  dt     DATE,
  value1 NUMBER,
  value2 NUMBER
);

Then:

SELECT *
FROM   data;

Outputs:

ID | TYP | DT                  | VALUE1 | VALUE2
-: | :-- | :------------------ | -----: | -----:
 1 | A   | 2021-03-01 12:00:00 |     10 |     24
 1 | A   | 2021-03-02 12:00:00 |     25 |     25
 1 | A   | 2021-03-03 12:00:00 |     31 |     36
 2 | B   | 2021-03-01 12:00:00 |     10 |     14
 2 | B   | 2021-03-02 12:00:00 |     25 |     22
 2 | B   | 2021-03-03 12:00:00 |     31 |     12
 3 | A   | 2021-03-01 12:00:00 |     10 |     24
 3 | A   | 2021-03-02 12:00:00 |     25 |     25
 3 | A   | 2021-03-03 12:00:00 |     31 |     36
 4 | C   | 2021-03-01 12:00:00 |     10 |     38
 4 | C   | 2021-03-02 12:00:00 |     25 |     23
 4 | C   | 2021-03-03 12:00:00 |     31 |     34

db<>fiddle here

Or, just using SUBSTR:

DECLARE
  v_data CLOB;
  v_line_start PLS_INTEGER := 1;
  v_line_end   PLS_INTEGER;
  v_line       VARCHAR2(32000);
  v_v2_pos     PLS_INTEGER;
BEGIN
  -- Read in your file.
  v_data := 'ID  TYPE DATE                VALUE1   VALUE2
1   A    3-1-2021 12:00:00   10       24
1   A    3-2-2021 12:00:00   25       25
1   A    3-3-2021 12:00:00   31       36
ID  TYPE DATE                VALUE1   UNNECESSARY_VALUE VALUE2
2   B    3-1-2021 12:00:00   10       24                14
2   B    3-2-2021 12:00:00   25       25                22
2   B    3-3-2021 12:00:00   31       36                12
ID  TYPE DATE                VALUE1   VALUE2
3   A    3-1-2021 12:00:00   10       24
3   A    3-2-2021 12:00:00   25       25
3   A    3-3-2021 12:00:00   31       36
ID  TYPE DATE                VALUE1   UNNECESSARY_VALUE1 UNNECESSARY_VALUE2 VALUE2
4   C    3-1-2021 12:00:00   10       21                 22                 38
4   C    3-2-2021 12:00:00   25       21                 42                 23
4   C    3-3-2021 12:00:00   31       22                 32                 34';

  WHILE v_line_start > 0 LOOP
    v_line_end := INSTR( v_data, CHR(10), v_line_start );
    IF v_line_end > 0 THEN
      v_line := DBMS_LOB.SUBSTR( v_data, v_line_end - v_line_start, v_line_start );
      v_line_start := v_line_end + 1;
    ELSE
      v_line := DBMS_LOB.SUBSTR( v_data, LENGTH( v_data ) - v_line_start + 1, v_line_start );
      v_line_start := 0;
    END IF;
    IF v_line LIKE 'ID  TYPE DATE %' THEN
      v_v2_pos := INSTR( v_line, ' ', -1 ) + 1;
      CONTINUE;
    END IF;

    INSERT INTO data (
      id,
      typ,
      dt,
      value1,
      value2
    ) VALUES (
      TO_NUMBER( SUBSTR( v_line, 1, 4 ) ),
      RTRIM( SUBSTR( v_line, 5, 5 ) ),
      TO_DATE( SUBSTR( v_line, 10, 18 ), 'MM-DD-YYYY HH24:MI:SS' ),
      TO_NUMBER( SUBSTR( v_line, 30, 9 ) ),
      TO_NUMBER( SUBSTR( v_line, v_v2_pos ) )
    );
  END LOOP;
END;
/

db<>fiddle here

Upvotes: 1

Related Questions