Philip
Philip

Reputation: 105

Array 1 contained in array 2 and elements in the same order

Does PostgreSQL have some way for me to find if an array is contained within another array, but with the same ordering?
For example, I want to know if array1 is within array2 with matching elements in the same order.

array1[1, 3, 6, 8]
array2[3, 8, 2, 9, 10, 1, 6]

Obviously not the case in the example, but is there a built-in method for this in PostgreSQL or should I create my own function?

Version of PostgreSQL is 9.6. The actual numbers the query will run on are bigints.

Upvotes: 2

Views: 176

Answers (1)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656952

General case

All elements of the second array are in the first, too. In the same order, but gaps are allowed.

I suggest this polymorphic PL/pgSQL function:

CREATE OR REPLACE FUNCTION array_contains_array_in_order(arr1 ANYARRAY
                                                       , arr2 ANYARRAY
                                                       , elem ANYELEMENT = NULL)
  RETURNS bool AS
$func$
DECLARE
   pos int := 1;
BEGIN
   FOREACH elem in ARRAY arr2
   LOOP
      pos := pos + array_position(arr1[pos:], elem);  -- see below
      IF pos IS NULL THEN
         RETURN FALSE;
      END IF;
   END LOOP; 

   RETURN true; --  all elements found in order
END
$func$ LANGUAGE plpgsql IMMUTABLE COST 3000;

As @a_horse commented, we can omit the upper bound in array subscripts to mean "unbounded" (arr1[pos:]). In older versions before 9.6 substitute with arr1[pos:2147483647] - 2147483647 = 2^31 - 1 being the theoretical max array index, the greatest signed int4 number.

This works for ...

About the ANYELEMENT trick:

Performance

I ran a quick performance test comparing this function to the one @a_horse supplied. This one was around 5x faster.

If you use this a filter for a big table I strongly suggest you combine it with a (logically redundant) sargable filter like:

SELECT *
FROM   tbl
WHERE  arr @> '{2,160,134,58,149,111}'::int[] 
AND    array_contains_array_in_order(arr, '{2,160,134,58,149,111}')  

This will use a GIN index on the array column like:

CREATE INDEX ON tbl USING gin (arr);

And only filter remaining (typically very few!) arrays that share all elements. Typically much faster.

Caveats with intarray module

Note: applies to integer[] exclusively, not smallint[] or bigint[] or any other array type!

Careful, if you have installed the intarray extension, which provides its own variant of the @> operator for int[]. Either you create an (additional) GIN index with its special operator class (which is a bit faster where applicable):

CREATE INDEX ON intarr USING gin (arr gin__int_ops);

Or, while you only have a GIN index with the default operator class, you must explicitly denote the standard operator to cooperate with the index:

WHERE  arr OPERATOR(pg_catalog.@>) '{2,160,134,58,149,111}'::int[] 

Details:

Simple case

As commented, your case is simpler:

The complete second array is included in the first (same order, no gaps!).

CREATE OR REPLACE FUNCTION array_contains_array_exactly(arr1 ANYARRAY, arr2 ANYARRAY)
  RETURNS bool AS
$func$
DECLARE
   len int := array_length(arr2, 1) - 1;  -- length of arr2 - 1 to fix off-by-1
   pos int;                               -- for current search postition in arr1
BEGIN
   /*                                     -- OPTIONAL, if invalid input possible
   CASE array_length(arr1, 1) >  len      -- array_length(arr2, 1) - 1 
   WHEN TRUE  THEN                        -- valid arrays
      -- do nothing, proceed
   WHEN FALSE THEN                        -- arr1 shorter than arr2
      RETURN FALSE;                       -- or raise exception?
   ELSE                                   -- at least one array empty or NULL
      RETURN NULL;
   END CASE;
   */

   pos := array_position(arr1, arr2[1]);  -- pos of arr2's 1st elem in arr1

   WHILE pos IS NOT NULL
   LOOP
      IF arr1[pos:pos+len] = arr2 THEN    -- array slice matches arr2 *exactly*
         RETURN TRUE;                     -- arr2 is part of arr1
      END IF;

      pos := pos + array_position(arr1[(pos+1):], arr2[1]);
   END LOOP; 

   RETURN FALSE;
END
$func$ LANGUAGE plpgsql IMMUTABLE COST 1000;

Considerably faster than the above for longer arrays. All other considerations still apply.

Upvotes: 2

Related Questions