C++ substring matching implementation

Question

I have two strings, "hello" an "eo" for instance, and I wish to find duplicate characaters between the two strings, that is: 'e' and 'o' in this example.

My algorithm would go this way

 void find_duplicate(char* str_1, char* str_2, int len1, int len2)
 {
     char c ;

     if(len1 < len2)
     {
        int* idx_1 = new int[len1]; // record elements in little string
        // that are matched in big string
        for(int k = 0 ; k < len1 ; k++)
              idx_1[k] = 0;

        int* idx_2 = new int[len2]; // record if element in str_2 has been 
        // matched already or not
        for(int k = 0 ; k < len2 ; k++)
              idx_2[k] = 0;

        for(int i = 0 ; i < len2 ; i++)
        {     
            c = str_1[i];

            for(int j = 0 ; j < len1 ; j++)
            {
                 if(str_2[j] == c)
                 {
                    if(idx_2[j] == 0) // this element in str_2 has not been matched yet
                    {
                         idx_1[i] = j + 1; // mark ith element in idx as matched in string 2 at pos j
                         idx_2[j] = 1;
                    }
                 }
             }
         }

         // now idx_1 and idx_2 contain matches info, let's remove matches.
         char* str_1_new = new char[len1];
         char* str_2_new = new char[len2];
         int kn = 0;
         for(int k = 0 ; k < len1 ; k++)
         {
            if(idx_1[k] > 0)
            {
                 str_1_new[kn] = str_1[k];
                 kn++;
            }
         }

         kn = 0;
         for(int k = 0 ; k < len2 ; k++)
         {
             if(idx_2[k] > 0)
             {
                 str_2_new[kn] = str_2[k];
                 kn++;
             }
         }
      }
      else
      {
            // same here, switching roles (do it yourself)
       }
  }

i feel my solution is awkward: - symetry of both cases in first if/else and code duplication - time complexity: 2*len1*len2 operations for finding duplicates, then len1 + len2 operations for removal - space complexity: two len1 and two len2 char*.

What if len1 and len2 are not given (with and without resort to STL vector) ?

could you provide your implementation of this algo ?

thanks

Zaroth · Accepted Answer

First of all, it isn't substring matching problem - it is problem of finding common characters between two strings.

Your solution works in O(n*m), where n=len1 and m=len2 in your code. You could easily solve the same problem in O(n+m+c) time by counting characters in each of strings (where c is equal to size of character set). This algorithm is called counting sort.

Sample code implementing this in your case:

#include 
#include  // for strlen and memset

const int CHARLEN = 256; //number of possible chars

using namespace std;

// returns table of char duplicates
char* find_duplicates(const char* str_1, const char* str_2, const int len1, const int len2)
{
  int *count_1 = new int[CHARLEN];
  int *count_2 = new int[CHARLEN];
  char *duplicates = new char[CHARLEN+1]; // we hold duplicate chars here
  int dupl_len = 0; // length of duplicates table, we insert '\0' at the end
  memset(count_1,0,sizeof(int)*CHARLEN);
  memset(count_2,0,sizeof(int)*CHARLEN);
  for (int i=0; i 0 && count_2[i] > 0)
    {
      duplicates[dupl_len] = i;
      ++dupl_len;
    }
  }
  duplicates[dupl_len]='\0';
  delete count_1;
  delete count_2;
  return duplicates;
}

int main()
{
  const char* str_1 = "foobar";
  const char* str_2 = "xro";
  char* dup =   find_duplicates(str_1, str_2, strlen(str_1), strlen(str_2));
  cout << "str_1: "" << str_1 << "" str_2: "" << str_2 << ""
";
  cout << "duplicates: "" << dup << ""
";
  delete dup;
  return 0;
}

Please note that I am also sorting the output here. If you do not want to do that, you can skip the counting of characters in second string and just start comparing duplicates on the go.

If you, however, intend to be able to detect multiple duplicates of the same letter (e.g. if "banana" and "arena" should output "aan" instead of "an"), then you can just substract the number of counts in current solution and adjust the output accordingly.

C++ substring matching implementation

Answers (2)

Related Questions