Reputation: 839

Extract text after Nth occurance

I have several strings like this:

1:723:NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor

I need to return everything after the second colon (:). So, from the above string I need:

NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor

Seeing lots of examples, but none are really working for this task.

Upvotes: 1

Answers (4)

JohnH

Reputation: 3135

The following demonstrates using NULLIF and REGEXP_REPLACE with the regular expression ^(([^:]*:){2}(.*))|(.*) to extract the portion of a string that occurs after a second :, or NULL if there are no characters after the second ::

WITH
  test_cases (id, test_value, expected_result) AS (
    VALUES
      (1, NULL, NULL),
      (2, '', NULL),
      (3, '1:', NULL),
      (4, '1:2:', NULL),
      (5, '1:2: ', ' '),
      (6,
       '1:723:NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor',
        'NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor'
      )
  ),
  runs AS (
    SELECT
      tc.id,
      tc.test_value,
      tc.expected_result,
      NULLIF(
        REGEXP_REPLACE(tc.test_value, '^(([^:]*:){2}(.*))|(.*)', '\3'),
        ''
      ) AS actual_result
    FROM
      test_cases tc
  )
SELECT
  runs.id,
  runs.expected_result IS NOT DISTINCT FROM runs.actual_result AS passed,
  runs.actual_result
FROM
  runs
ORDER BY
  runs.id;

The following script measures the overhead associated with generating test strings (test expression s.string||gs.n::TEXT) and the execution time for each of four expressions that return the portion of the string that occurs after the second ::

DROP TABLE IF EXISTS test_runs;

CREATE TEMPORARY TABLE test_runs (
  id INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  test_case INTEGER,
  test_expression TEXT,
  execution_time_ms NUMERIC
);

DO language plpgsql $BLOCK$
DECLARE
  number_of_iterations CONSTANT INTEGER := 100;
  number_of_rows CONSTANT INTEGER := 10000;
  test_expressions CONSTANT TEXT[] := ARRAY[
    $$s.string||gs.n::TEXT$$,
    $$SUBSTRING(s.string||gs.n::TEXT, '^.+?:.+?:(.*$)')$$,
    $$ARRAY_TO_STRING((STRING_TO_ARRAY(s.string||gs.n::TEXT, ':'))[3:], ':')$$,
    $$ARRAY_TO_STRING((REGEXP_SPLIT_TO_ARRAY(s.string||gs.n::TEXT, ':'))[3:], ':')$$,
    $$REGEXP_REPLACE(s.string||gs.n::TEXT, '^(([^:]*:){2}(.*))|(.*)', '\3')$$
  ];
  iteration INTEGER;
  results JSON;
  test_case INTEGER;
  test_run INTEGER := 0;
BEGIN
  FOR iteration IN 1 .. number_of_iterations LOOP
    FOR test_case IN 1 .. CARDINALITY(test_expressions) LOOP
      EXECUTE
        FORMAT($$
          EXPLAIN (ANALYZE, FORMAT JSON)
          SELECT %1$s
            FROM
              (VALUES ('1:723:NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor')) s(string)
              CROSS JOIN GENERATE_SERIES(1, $2) gs(n);$$,
          test_expressions[test_case])
        INTO results
        USING test_case, number_of_rows;
      INSERT INTO test_runs(test_case, test_expression, execution_time_ms)
      VALUES (test_case, test_expressions[test_case], (results -> 0 ->> 'Execution Time')::NUMERIC);
    END LOOP;
  END LOOP;
END;
$BLOCK$;

SELECT AVG(execution_time_ms)::NUMERIC(18,3) AS mean_execution_time_ms, test_expression
  FROM test_runs
  GROUP BY test_case, test_expression
  ORDER BY mean_execution_time_ms, test_case;

mean_execution_time_ms	test_expression
3.486	s.string\|\|gs.n::TEXT
24.911	ARRAY_TO_STRING((STRING_TO_ARRAY(s.string\|\|gs.n::TEXT, ':'))[3:], ':')
41.494	SUBSTRING(s.string\|\|gs.n::TEXT, '^.+?:.+?:(.*$)')
61.743	REGEXP_REPLACE(s.string\|\|gs.n::TEXT, '^(([^:]:){2}(.))\|(.*)', '\3')
87.302	ARRAY_TO_STRING((REGEXP_SPLIT_TO_ARRAY(s.string\|\|gs.n::TEXT, ':'))[3:], ':')

Upvotes: 1

Erwin Brandstetter

Reputation: 658072

Shortest

substring() with a regular expression:

SELECT substring(string, '^.+?:.+?:(.*$)');

The regexp explained:

^ ... anchor to start of string (optional)
. ... any character
+? ... 1 or more times, non-greedy
: ... the literal character
(2 times)
(.*$) ... any characters 0 or more times, greedy; in capturing parentheses

test=> SELECT substring('1:723:NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor', '^.+?:.+?:(.*$)');
                                     substring                                     
-----------------------------------------------------------------------------------
 NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor
(1 row)

Fastest

If performance matters ...

The above is simple and short, so it performs decently. But regular expressions are expensive. Even (costly) converting to an array and back like Adrian demonstrates is typically faster.

This is a couple times faster than anything suggested so far:

SELECT substring(string, length(split_part(string, ':', 1))
                       + length(split_part(string, ':', 2))
                       + 3);

Upvotes: 1

Adrian Klaver

Reputation: 19684

select array_to_string((string_to_array('1:723:NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor', ':'))[3:], ':');

NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor

This uses string_to_array from here:

https://www.postgresql.org/docs/current/functions-string.html

string_to_array ( string text, delimiter text [, null_string text ] ) → text[]

Splits the string at occurrences of delimiter and forms the resulting fields into a text array. If delimiter is NULL, each character in the string will become a separate element in the array. If delimiter is an empty string, then the string is treated as a single field. If null_string is supplied and is not NULL, fields matching that string are replaced by NULL. See also array_to_string.

string_to_array('xx~~yy~~zz', '~~', 'yy') → {xx,NULL,zz}

This breaks the string down into it's component parts and makes an array out of them. The [3:] selects the third through the end of the array elements from the array.

Then from here:

https://www.postgresql.org/docs/current/functions-array.html

array_to_string ( array anyarray, delimiter text [, null_string text ] ) → text

Converts each array element to its text representation, and concatenates those separated by the delimiter string. If null_string is given and is not NULL, then NULL array entries are represented by that string; otherwise, they are omitted. See also string_to_array.

array_to_string(ARRAY[1, 2, 3, NULL, 5], ',', '') → 1,2,3,,5

This reconstitutes the string by concatenating the array elements with the delimiter.

Upvotes: 1

Frank Heikens

Reputation: 127367

This one returns an array, by splitting the content on the :

SELECT (regexp_split_to_array('1:723:NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor',':'))[3:];

If you need a single string:

SELECT array_to_string((regexp_split_to_array('1:723:NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor',':'))[3:], ':');

This one uses a regex to split the content and also returns an array, but just one element and that's easy to convert to a single string:

SELECT (regexp_matches('1:723:NVDA:NSDQ::Z4189274321:Snapshot Report:10:COMPUTER & TECHNOLOGY:241:Semiconductor', '^[^:]*:[^:]*:(.*)', 'g'))[1];

Upvotes: 0

Extract text after Nth occurance

Answers (4)

Shortest

Fastest

Related Questions