Ankur
Ankur

Reputation: 3724

Find whether a 2d matrix is subset of another 2d matrix

Recently i was taking part in one Hackathon and i came to know about a problem which tries to find a pattern of a grid form in a 2d matrix.A pattern could be U,H and T and will be represented by 3*3 matrix suppose if i want to present H and U

+--+--+--+            +--+--+--+
|1 |0 |1 |            |1 |0 |1 |
+--+--+--+            +--+--+--+
|1 |1 |1 |  --> H     |1 |0 |1 |    -> U
+--+--+--+            +--+--+--+
|1 |0 |1 |            |1 |1 |1 |
+--+--+--+            +--+--+--+

Now i need to search this into 10*10 matrix containing 0s and 1s.Closest and only solution i can get it brute force algorithm of O(n^4).In languages like MATLAB and R there are very subtle ways to do this but not in C,C++. I tried a lot to search this solution on Google and on SO.But closest i can get is this SO POST which discuss about implementing Rabin-Karp string-search algorithm .But there is no pseudocode or any post explaining this.Could anyone help or provide any link,pdf or some logic to simplify this?

EDIT

as Eugene Sh. commented that If N is the size of the large matrix(NxN) and k - the small one (kxk), the buteforce algorithm should take O((Nk)^2). Since k is fixed, it is reducing to O(N^2).Yes absolutely right. But is there is any generalised way if N and K is big?

Upvotes: 12

Views: 7717

Answers (3)

mrk
mrk

Reputation: 3191

I'm going to present an algorithm that takes O(n*n) time in the worst case whenever k = O(sqrt(n)) and O(n*n + n*k*k) in general. This is an extension of Aho-Corasick to 2D. Recall that Aho-Corasick locates all occurrences of a set of patterns in a target string T, and it does so in time linear in pattern lengths, length of T, and number of occurrences.

Let's introduce some terminology. The haystack is the large matrix we are searching in and the needle is the pattern-matrix. The haystack is a nxn matrix and the needle is a kxk matrix. The set of patterns that we are going to use in Aho-Corasick is the set of rows of the needle. This set contains at most k rows and will have fewer if there are duplicate rows.

We are going to build the Aho-Corasick automaton (which is a Trie augmented with failure links) and then run the search algorithm on each row of the haystack. So we take every row of the needle and search for it in every row of the haystack. We can use a linear time 1D matching algorithm to do this but that would still be inefficient. The advantage of Aho-Corasick is that it searches for all patterns at once.

During the search we are going to populate a matrix A which we are going to use later. When we search in the first row of the haystack, the first row of A is filled with the occurrences of rows of the needle in the first row of the haystack. So we'll end up with a first row of A that looks like 2 - 0 - - 1 for example. This means that row number 0 of the needle appears at position 2 in the first row of the haystack; row number 1 appears at position 5; row number 2 appears at position 0. The - entries are positions that did not get matched. Keep doing this for every row.

Let's for now assume that there are no duplicate rows in the needle. Assign 0 to the first row of the needle, 1 to the second, and so on. Now we are going to search for the pattern [0 1 2 ... k-1] in every column of the matrix A using a linear time 1D search algorithm (KMP for example). Recall that every row of A stores positions at which rows of the needle appear. So if a column contains the pattern [0 1 2 ... k-1], this means that row number 0 of the needle appears at some row of the haystack, row number 1 of the needle is just below it, and so on. This is exactly what we want. If there are duplicate rows, just assign a unique number to each unique row.

Search in column takes O(n) using a linear time algorithm. So searching all columns takes O(n*n). We populate the matrix during the search, we search every row of the haystack (there are n rows) and the search in a row takes O(n+k*k). So O(n(n+k*k)) overall.

So the idea was to find that matrix and then reduce the problem to 1D pattern matching. Aho-Corasick is just there for efficiency, I don't know if there is another efficient way to find the matrix.

EDIT: added implementation.

Here is my c++ implementation. The max value of n is set to 100 but you can change it.

The program starts by reading two integers n k (the dimensions of the matrices). Then it reads n lines each containing a string of 0's and 1's of length n. Then it reads k lines each containing a string of 0's and 1's of length k. The output is the upper-left coordinate of all matches. For the following input for example.

12 2
101110111011
111010111011
110110111011
101110111010
101110111010
101110111010
101110111010
111010111011
111010111011
111010111011
111010111011
111010111011
11
10

The program will output:

match at (2,0)
match at (1,1)
match at (0,2)
match at (6,2)
match at (2,10)

#include <cstdio>
#include <cstring>
#include <string>
#include <queue>
#include <iostream>

using namespace std;

const int N = 100;
const int M = N;
int n, m;

string haystack[N], needle[M];
int A[N][N]; /* filled by successive calls to match */
int p[N]; /* pattern to search for in columns of A */

struct Node
{   Node *a[2]; /* alphabet is binary */
    Node *suff; /* pointer to node whose prefix = longest proper suffix of this node */
    int flag;

    Node()
    {   a[0] = a[1] = 0;
        suff = 0;
        flag = -1;
    }
};

void insert(Node *x, string s)
{   static int id = 0;
    static int p_size = 0;

    for(int i = 0; i < s.size(); i++)
    {   char c = s[i];
        if(x->a[c - '0'] == 0)
            x->a[c - '0'] = new Node;
        x = x->a[c - '0'];
    }
    if(x->flag == -1)
        x->flag = id++;

    /* update pattern */
    p[p_size++] = x->flag;
}

Node *longest_suffix(Node *x, int c)
{   while(x->a[c] == 0)
        x = x->suff;
    return x->a[c];
}

Node *mk_automaton(void)
{   Node *trie = new Node;
    for(int i = 0; i < m; i++)
    {   insert(trie, needle[i]);
    }

    queue<Node*> q;

    /* level 1 */
    for(int i = 0; i < 2; i++)
    {   if(trie->a[i])
        {   trie->a[i]->suff = trie;
            q.push(trie->a[i]);
        }
        else trie->a[i] = trie;
    }

    /* level > 1 */
    while(q.empty() == false)
    {   Node *x = q.front(); q.pop();
        for(int i = 0; i < 2; i++)
        {   if(x->a[i] == 0) continue;
            x->a[i]->suff = longest_suffix(x->suff, i);
            q.push(x->a[i]);
        }
    }

    return trie;
}

/* search for patterns in haystack[j] */
void match(Node *x, int j)
{   for(int i = 0; i < n; i++)
    {   x = longest_suffix(x, haystack[j][i] - '0');
        if(x->flag != -1)
        {   A[j][i-m+1] = x->flag;
        }
    }
}

int match2d(Node *x)
{   int matches = 0;
    static int z[M+N];
    static int z_str[M+N+1];

    /* init */
    memset(A, -1, sizeof(A));

    /* fill the A matrix */
    for(int i = 0; i < n; i++)
    {   match(x, i);
    }

    /* build string for z algorithm */
    z_str[n+m] = -2; /* acts like `\0` for strings */
    for(int i = 0; i < m; i++)
    {   z_str[i] = p[i];
    }

    for(int i = 0; i < n; i++)
    {   /* search for pattern in column i */
        for(int j = 0; j < n; j++)
        {   z_str[j + m] = A[j][i];
        }

        /* run z algorithm */
        int l, r;
        l = r = 0;
        z[0] = n + m;
        for(int j = 1; j < n + m; j++)
        {   if(j > r)
            {   l = r = j;
                while(z_str[r] == z_str[r - l]) r++;
                z[j] = r - l;
                r--;
            }
            else
            {   if(z[j - l] < r - j + 1)
                {   z[j] = z[j - l];
                }
                else
                {   l = j;
                    while(z_str[r] == z_str[r - l]) r++;
                    z[j] = r - l;
                    r--;
                }
            }
        }

        /* locate matches */
        for(int j = m; j < n + m; j++)
        {   if(z[j] >= m)
            {   printf("match at (%d,%d)\n", j - m, i);
                matches++;
            }
        }
    }

    return matches;
}

int main(void)
{   cin >> n >> m;
    for(int i = 0; i < n; i++)
    {   cin >> haystack[i];
    }
    for(int i = 0; i < m; i++)
    {   cin >> needle[i];
    }

    Node *trie = mk_automaton();
    match2d(trie);

    return 0;
}

Upvotes: 3

5gon12eder
5gon12eder

Reputation: 25449

Alright, here is then the 2D Rabin-Karp approach.

For the following discussion, assume we want to find a (m, m) sub-matrix inside a (n, n) matrix. (The concept works for rectangular matrices just as well but I ran out of indices.)

The idea is that for each possible sub-matrix, we compute a hash. Only if that hash matches the hash of the matrix we want to find, we will compare element-wise.

To make this efficient, we must avoid re-computing the entire hash of the sub-matrix each time. Because I got little sleep tonight, the only hash function for which I could figure out how to do this easily is the sum of 1s in the respective sub-matrix. I leave it as an exercise to someone smarter than me to figure out a better rolling hash function.

Now, if we have just checked the sub-matrix from (i, j) to (i + m – 1, j + m – 1) and know it has x 1s inside, we can compute the number of 1s in the sub-matrix one to the right – that is, from (i, j + 1) to (i + m – 1, j + m) – by subtracting the number of 1s in the sub-vector from (i, j) to (i + m – 1, j) and adding the number of 1s in the sub-vector from (i, j + m) to (i + m – 1, j + m).

If we hit the right margin of the large matrix, we shift the window down by one and then back to the left margin and then again down by one and then again to the right and so forth.

Note that this requires O(m) operations, not O(m2) for each candidate. If we do this for every pair of indices, we get O(mn2) work. Thus, by cleverly shifting a window of the size of the potential sub-matrix through the large matrix, we can reduce the amount of work by a factor of m. That is, if we don't get too many hash collisions.

Here is a picture:

The rolling hash function for the window shifted to the right.

As we shift the current window one to the right, we subtract the number of 1s in the red column vector on the left side and add the number of 1s in the green column vector on the right side to obtain the number of 1s in the new window.

I have implemented a quick demo of this idea using the great Eigen C++ template library. The example also uses some stuff from Boost but only for argument parsing and output formatting so you can easily get rid of it if you don't have Boost but want to try out the code. The index fiddling is a bit messy but I'll leave it without further explanation here. The above prose should cover it sufficiently.

#include <cassert>
#include <cstddef>
#include <cstdlib>
#include <iostream>
#include <random>
#include <type_traits>
#include <utility>

#include <boost/format.hpp>
#include <boost/lexical_cast.hpp>

#include <Eigen/Dense>

#define PROGRAM "submatrix"
#define SEED_CSTDLIB_RAND 1

using BitMatrix = Eigen::Matrix<bool, Eigen::Dynamic, Eigen::Dynamic>;
using Index1D = BitMatrix::Index;
using Index2D = std::pair<Index1D, Index1D>;

std::ostream&
operator<<(std::ostream& out, const Index2D& idx)
{
  out << "(" << idx.first << ", " << idx.second << ")";
  return out;
}

BitMatrix
get_random_bit_matrix(const Index1D rows, const Index1D cols)
{
  auto matrix = BitMatrix {rows, cols};
  matrix.setRandom();
  return matrix;
}

Index2D
findSubMatrix(const BitMatrix& haystack,
              const BitMatrix& needle,
              Index1D *const collisions_ptr = nullptr) noexcept
{
  static_assert(std::is_signed<Index1D>::value, "unsigned index type");
  const auto end = Index2D {haystack.rows(), haystack.cols()};
  const auto hr = haystack.rows();
  const auto hc = haystack.cols();
  const auto nr = needle.rows();
  const auto nc = needle.cols();
  if (nr > hr || nr > hc)
    return end;
  const auto target = needle.count();
  auto current = haystack.block(0, 0, nr - 1, nc).count();
  auto j = Index1D {0};
  for (auto i = Index1D {0}; i <= hr - nr; ++i)
    {
      if (j == 0)  // at left margin
        current += haystack.block(i + nr - 1, 0, 1, nc).count();
      else if (j == hc - nc)  // at right margin
        current += haystack.block(i + nr - 1, hc - nc, 1, nc).count();
      else
        assert(!"this should never happen");
      while (true)
        {
          if (i % 2 == 0)  // moving right
            {
              if (j > 0)
                current += haystack.block(i, j + nc - 1, nr, 1).count();
            }
          else  // moving left
            {
              if (j < hc - nc)
                current += haystack.block(i, j, nr, 1).count();
            }
          assert(haystack.block(i, j, nr, nc).count() == current);
          if (current == target)
            {
              // TODO: There must be a better way than using cwiseEqual().
              if (haystack.block(i, j, nr, nc).cwiseEqual(needle).all())
                return Index2D {i, j};
              else if (collisions_ptr)
                *collisions_ptr += 1;
            }
          if (i % 2 == 0)  // moving right
            {
              if (j < hc - nc)
                {
                  current -= haystack.block(i, j, nr, 1).count();
                  ++j;
                }
              else break;
            }
          else  // moving left
            {
              if (j > 0)
                {
                  current -= haystack.block(i, j + nc - 1, nr, 1).count();
                  --j;
                }
              else break;
            }
        }
      if (i % 2 == 0)  // at right margin
        current -= haystack.block(i, hc - nc, 1, nc).count();
      else  // at left margin
        current -= haystack.block(i, 0, 1, nc).count();
    }
  return end;
}

int
main(int argc, char * * argv)
{
  if (SEED_CSTDLIB_RAND)
    {
      std::random_device rnddev {};
      srand(rnddev());
    }
  if (argc != 5)
    {
      std::cerr << "usage: " << PROGRAM
                << " ROWS_HAYSTACK COLUMNS_HAYSTACK"
                << " ROWS_NEEDLE COLUMNS_NEEDLE"
                << std::endl;
      return EXIT_FAILURE;
    }
  auto hr = boost::lexical_cast<Index1D>(argv[1]);
  auto hc = boost::lexical_cast<Index1D>(argv[2]);
  auto nr = boost::lexical_cast<Index1D>(argv[3]);
  auto nc = boost::lexical_cast<Index1D>(argv[4]);
  const auto haystack = get_random_bit_matrix(hr, hc);
  const auto needle = get_random_bit_matrix(nr, nc);
  auto collisions = Index1D {};
  const auto idx = findSubMatrix(haystack, needle, &collisions);
  const auto end = Index2D {haystack.rows(), haystack.cols()};
  std::cout << "This is the haystack:\n\n" << haystack << "\n\n";
  std::cout << "This is the needle:\n\n" << needle << "\n\n";
  if (idx != end)
    std::cout << "Found as sub-matrix at " << idx << ".\n";
  else
    std::cout << "Not found as sub-matrix.\n";
  std::cout << boost::format("There were %d (%.2f %%) hash collisions.\n")
    % collisions
    % (100.0 * collisions / ((hr - nr) * (hc - nc)));
  return (idx != end) ? EXIT_SUCCESS : EXIT_FAILURE;
}

While it compiles and runs, please consider the above as pseudo-code. I have made almost no attempt at optimizing it. It was just a proof-of concept for myself.

Upvotes: 11

kraskevich
kraskevich

Reputation: 18566

Let's start with an O(N * N * K) solution. I will use the following notation: A is a pattern matrix, B is a big matrix(the one we will search for occurrences of the pattern in).

  1. We can fix a top row of the B matrix(that is, we will search for all occurrence that start in a position (this row, any column). Let's call this row a topRow. Now we can take a slice of this matrix that contains [topRow; topRow + K) rows and all columns.

  2. Let's create a new matrix as a result of concatenation A + column + the slice, where a column is a column with K elements that are not present in A or B(if A and B consist of 0 and 1, we can use -1, for instance). Now we can treat columns of this new matrix as letters and run the Knuth-Morris-Pratt's algorithm. Comparing two letters requires O(K) time, thus the time complexity of this step is O(N * K).

There are O(N) ways to fix the top row, so the total time complexity is O(N * N * K). It is already better than a brute-force solution, but we are not done yet. The theoretical lower bound is O(N * N)(I assume that N >= K), and I want to achieve it.

Let's take a look at what can be improved here. If we could compare two columns of a matrix in O(1) time instead of O(k), we would have achieved the desired time complexity. Let's concatenate all columns of both A and B inserting some separator after each column. Now we have a string and we need to compare its substrings(because columns and their parts are substrings now). Let's construct a suffix tree in linear time(using Ukkonnen's algorithm). Now comparing two substrings is all about finding the height of the lowest common ancestor(LCA) of two nodes in this tree. There is an algorithm that allows us to do it with linear preprocessing time and O(1) time per LCA query. It means that we can compare two substrings(or columns) in constant time! Thus, the total time complexity is O(N * N). There is another way to achieve this time complexity: we can build a suffix array in linear time and answer the longest common prefix queries in constant time(with a linear time preprocessing). However, both of this O(N * N) solutions look pretty hard to implement and they will have a big constant.

P.S If we have a polynomial hash function that we can fully trust(or we are fine with a few false positives), we can get a much simpler O(N * N) solution using 2-D polynomial hashes.

Upvotes: 2

Related Questions