Sailing Judo
Sailing Judo

Reputation: 11233

regular expression for finding parts of a string within another

I have two strings: the first's value is "catdog" and the second's is "got".

I'm trying to find a regex that tells me if the letters for "got" are in "catdog". I'm particularly looking to avoid the case where there are duplicate letters. For example, I know "got" is a match, however "gott" is not a match because there are not two "t" in "catdog".

EDIT:

Based on Adam's response below this is the C# code I got to work in my solution. Thanks to all those that responded.

Note: I had to convert the char to int and subtract 97 to get the appropriate index for the array. In my case the letters are always lower case.

    private bool CompareParts(string a, string b)
    {

        int[] count1 = new int[26];
        int[] count2 = new int[26];

        foreach (var item in a.ToCharArray())
            count1[(int)item - 97]++;

        foreach (var item in b.ToCharArray())
            count2[(int)item - 97]++;

        for (int i = 0; i < count1.Length; i++)
            if(count2[i] > count1[i])
                return false;

        return true;
    }

Upvotes: 4

Views: 1667

Answers (7)

bart
bart

Reputation: 7787

The best way to do it with regular expressions is, IMO:

A. Sort the characters in the large string (search space) Thus: turn "catdog" into "acdgot"

B.

  1. Do the same with the string of which you search the characters of: "gott" becomes, eh, "gott"...

  2. Insert ".*" between each of these characters

  3. Use the latter as the regular expression to search in the former.

For example, some Perl code (if you don't mind):

$main = "catdog"; $search = "gott";
# break into individual characters, sort, and reconcatenate
$main = join '', sort split //, $main;
$regexp = join ".*", sort split //, $search;
print "Debug info: search in '$main' for /$regexp/ \n";
if($main =~ /$regexp/) {
    print "Found a match!\n";
} else {
    print "Sorry, no match...\n";
}

This prints:

Debug info: search in 'acdgot' for /g.*o.*t.*t/
Sorry, no match...

Drop one "t" and you get a match.

Upvotes: 0

BenAlabaster
BenAlabaster

Reputation: 39874

Previous suggestions have already been made that perhaps regex isn't the best way to do this and I agree, however, your accepted answer is a little verbose considering what you're trying to achieve and that is test to see if a set of letters is the subset of another set of letters.

Consider the following code which achieves this in a single line of code:

MatchString.ToList().ForEach(Item => Input.Remove(Item));

Which can be used as follows:

public bool IsSubSetOf(string InputString, string MatchString) 
{
  var InputChars = InputString.ToList(); 
  MatchString.ToList().ForEach(Item => InputChars.Remove(Item)); 
  return InputChars.Count == 0;
}

You can then just call this method to verify if it's a subset or not.

What is interesting here is that "got" will return a list with no items because each item in the match string only appears once, but "gott" will return a list with a single item because there would only be a single call to remove the "t" from the list. Consequently you would have an item left in the list. That is, "gott" is not a subset of "catdog" but "got" is.

You could take it one step further and put the method into a static class:

using System;
using System.Linq;
using System.Runtime.CompilerServices;

static class extensions
{
    public static bool IsSubSetOf(this string InputString, string MatchString)
    {
        var InputChars = InputString.ToList();
        MatchString.ToList().ForEach(Item => InputChars.Remove(Item));
        return InputChars.Count == 0;
    }
}

which makes your method into an extension of the string object which actually makes thins a lot easier in the long run, because you can now make your calls like so:

Console.WriteLine("gott".IsSubSetOf("catdog"));

Upvotes: 3

Adam Rosenfield
Adam Rosenfield

Reputation: 400642

You're using the wrong tool for the job. This is not something regular expressions are capable of handling easily. Fortunately, it's relatively easy to do this without regular expressions. You just count up the number of occurrences of each letter within both strings, and compare the counts between the two strings - if for each letter of the alphabet, the count in the first string is at least as large as the count in the second string, then your criteria are satisfied. Since you didn't specify a language, here's an answer in pseudocode that should be easily translatable into your language:

bool containsParts(string1, string2)
{
    count1 = array of 26 0's
    count2 = array of 26 0's

    // Note: be sure to check for an ignore non-alphabetic characters,
    // and do case conversion if you want to do it case-insensitively
    for each character c in string1:
        count1[c]++
    for each character c in string2:
        count2[c]++

    for each character c in 'a'...'z':
        if count1[c] < count2[c]:
            return false

    return true
}

Upvotes: 7

jfs
jfs

Reputation: 414865

@Adam Rosenfield's solution in Python:

from collections import defaultdict

def count(iterable):
    c = defaultdict(int)
    for hashable in iterable:
        c[hashable] += 1
    return c

def can_spell(word, astring):
    """Whether `word` can be spelled using `astring`'s characters."""

    count_string = count(astring)
    count_word   = count(word)

    return all(count_string[c] >= count_word[c] for c in word)

Upvotes: 0

Alan Moore
Alan Moore

Reputation: 75262

Charlie Martin almost has it right, but you have to do a complete pass for each letter. You can do that with a single regex by using lookaheads for all but the last pass:

/^
 (?=[^got]*g[^got]*$)
 (?=[^got]*o[^got]*$)
 [^got]*t[^got]*
$/x

This makes a nice exercise for honing your regex skills, but if I had to do this in real-life, I wouldn't do it this way. A non-regex approach will require a lot more typing, but any minimally competent programmer will be able to understand and maintain it. If you use a regex, that hypothetical maintainer will also have to be more-than-minimally competent at regexes.

Upvotes: 0

Charlie Martin
Charlie Martin

Reputation: 112414

You want a string that matches exact those letters, exactly once. It depends what you're writing the regex in, but it's going to be something like

^[^got]*(g|o|t)[^got]$

If you've got an operator for "exactly one match" that will help.

Upvotes: 0

derobert
derobert

Reputation: 51197

I don't think there is a sane way to do this with regular expressions. The insane way would be to write out all the permutations:

/^(c?a?t?d?o?g?|c?a?t?d?g?o?| ... )$/

Now, with a little trickery you could do this with a few regexps (example in Perl, untested):

$foo = 'got';
$foo =~ s/c//;
$foo =~ s/a//;
...
$foo =~ s/d//;
# if $foo is now empty, it passes the test.

Sane people would use a loop, of course:

$foo = 'got'
foreach $l (split(//, 'catdog') {
    $foo =~ s/$l//;
}
# if $foo is now empty, it passes the test.

There are much better performing ways to pull this off, of course, but they don't use regexps. And there are no doubt ways to do it if e.g., you can use Perl's extended regexp features like embedded code.

Upvotes: 0

Related Questions