Michael
Michael

Reputation: 42050

How to find all permutations of a given word in a given text?

This is an interview question (phone screen): write a function (in Java) to find all permutations of a given word that appear in a given text. For example, for word abc and text abcxyaxbcayxycab the function should return abc, bca, cab.

I would answer this question as follows:

How would you answer this question?

Upvotes: 10

Views: 7254

Answers (6)

Kunukn
Kunukn

Reputation: 2236

To find a permutation of a string you can use number theory. But you will have to know the 'theory' behind this algorithm in advance before you can answer the question using this algorithm.

There is a method where you can calculate a hash of a string using prime numbers. Every permutation of the same string will give the same hash value. All other string combination which is not a permutation will give some other hash value.

The hash-value is calculated by c1 * p1 + c2 * p2 + ... + cn * pn where ci is a unique value for the current char in the string and where pi is a unique prime number value for the ci char.

Here is the implementation.

public class Main {
    static int[] primes = new int[] { 2, 3, 5, 7, 11, 13, 17, 
        19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 
        73, 79, 83, 89, 97, 101, 103 };

    public static void main(String[] args) {        
        final char[] text = "abcxaaabbbccyaxbcayaaaxycab"
            .toCharArray();     
        char[] abc = new char[]{'a','b','c'};       
        int match = val(abc);                   
        for (int i = 0; i < text.length - 2; i++) {
            char[] _123 = new char[]{text[i],text[i+1],text[i+2]};          
            if(val(_123)==match){
                System.out.println(new String(_123) );      
            }
        }
    }   
    static int p(char c) {
        return primes[(int)c - (int)'a'];
    }   
    static int val(char[] cs) {
        return 
        p(cs[0])*(int)cs[0] + p(cs[1])*(int)cs[1] + p(cs[2])*(int)cs[2];        
    }
}

The output of this is: abc bca cab

Upvotes: 5

Vitaly Olegovitch
Vitaly Olegovitch

Reputation: 3547

This is probably not the most efficient solution algorithmically, but it is clean from a class design point of view. This solution takes the approach of comparing "sorted" given words.

We can say that a word is a permutation of another if it contains the same letters in the same number. This means that you can convert the word from a String to a Map<Character,Integer>. Such conversion will have complexity O(n) where n is the length of the String, assuming that insertions in your Map implementation cost O(1).

The Map will contain as keys all the characters found in the word and as values the frequencies of the characters.

Example. abbc is converted to [a->1, b->2, c->1]

bacb is converted to [a->1, b->2, c->1]

So if you have to know if two words are one the permutation of the other, you can convert them both into maps and then invoke Map.equals.

Then you have to iterate over the text string and apply the transformation to all the substrings of the same length of the words that you are looking for.

Improvement proposed by Inerdial

This approach can be improved by updating the Map in a "rolling" fashion.

I.e. if you're matching at index i=3 in the example haystack in the OP (the substring xya), the map will be [a->1, x->1, y->1]. When advancing in the haystack, decrement the character count for haystack[i], and increment the count for haystack[i+needle.length()].

(Dropping zeroes to make sure Map.equals() works, or just implementing a custom comparison.)

Improvement proposed by Max

What if we also introduce matchedCharactersCnt variable? At the beginning of the haystack it will be 0. Every time you change your map towards the desired value - you increment the variable. Every time you change it away from the desired value - you decrement the variable. Each iteration you check if the variable is equal to the length of needle. If it is - you've found a match. It would be faster than comparing the full map every time.

Pseudocode provided by Max:

needle = "abbc"
text = "abbcbbabbcaabbca"

needleSize = needle.length()
//Map of needle character counts
targetMap = [a->1, b->2, c->1]

matchedLength = 0
curMap = [a->0, b->0, c->0]
//Initial map initialization
for (int i=0;i<needle.length();i++) {
    if (curMap.contains(haystack[i])) {
        matchedLength++
        curMap[haystack[i]]++
    }
}

if (matchedLength == needleSize) {
    System.out.println("Match found at: 0");
}

//Search itself
for (int i=0;i<haystack.length()-needle.length();i++) {
    int targetValue1 = targetMap[haystack[i]]; //Reading from hashmap, O(1)
    int curValue1 = curMap[haystack[i]]; //Another read
    //If we are removing beneficial character
    if (targetValue1 > 0 && curValue1 > 0 && curValue1 <= targetValue1) {       
        matchedLength--;
    }
    curMap[haystack[i]] = curValue1 + 1; //Write to hashmap, O(1)


    int targetValue2 = targetMap[haystack[i+needle.length()]] //Read
    int curValue2 = curMap[haystack[i+needle.length()]] //Read
    //We are adding a beneficial character
    if (targetValue2 > 0 && curValue2 < targetValue2) { //If we don't need this letter at all, the amount of matched letters decreases
        matchedLength++;
    }
    curMap[haystack[i+needle.length()]] = curValue2 + 1; //Write

    if (matchedLength == needleSize) {
        System.out.println("Match found at: "+(i+1));
    }
}

//Basically with 4 reads and 2 writes which are 
//independent of the size of the needle,
//we get to the maximal possible performance: O(n)

Upvotes: 12

Andrea Parodi
Andrea Parodi

Reputation: 5614

This code should do the work:

import java.util.ArrayList;
import java.util.List;

public class Permutations {
    public static void main(String[] args) {
        final String word = "abc";
        final String text = "abcxaaabbbccyaxbcayxycab";
        List<Character> charsActuallyFound = new ArrayList<Character>();
        StringBuilder match = new StringBuilder(3);

        for (Character c : text.toCharArray()) {
            if (word.contains(c.toString()) && !charsActuallyFound.contains(c)) {
                charsActuallyFound.add(c);
                match.append(c);
                if (match.length()==word.length())
                {
                    System.out.println(match);
                    match = new StringBuilder(3);
                    charsActuallyFound.clear();
                }
            } else {
                match = new StringBuilder(3);
                charsActuallyFound.clear();
            }
        }
    }
}

The charsActuallyFound List is used to keep track of character already found in the loop. It is needed to avoid mathing "aaa" "bbb" "ccc" (added by me to the text you specified).

After further reflection, I think my code only work if the given word has no duplicate characters. The code above correctly print

abc
bca
cab

but if you seaarch for the word "aaa", then nothing is printed, because each char can not be matched more than one time. Inspired from Jim Mischel answer, I edit my code, ending with this:

import java.util.ArrayList;
import java.util.List;

public class Permutations {
    public static void main(String[] args) {
        final String text = "abcxaaabbbccyaxbcayaaaxycab";

        printMatches("aaa", text);
        printMatches("abc", text);
    }

    private static void printMatches(String word, String text) {
        System.out.println("matches for "+word +" in "+text+":");

        StringBuilder match = new StringBuilder(3);
        StringBuilder notYetFounds=new StringBuilder(word);

        for (Character c : text.toCharArray()) {
            int idx = notYetFounds.indexOf(c.toString());
            if (idx!=-1) {
               notYetFounds.replace(idx,idx+1,"");

                match.append(c);
                if (match.length()==word.length())
                {
                    System.out.println(match);
                    match = new StringBuilder(3);
                    notYetFounds=new StringBuilder(word);
                }
            } else {
                match = new StringBuilder(3);
                notYetFounds=new StringBuilder(word);
            }
        }
        System.out.println();
    }

}

This give me following output:

matches for aaa in abcxaaabbbccyaxbcayaaaxycab:
aaa
aaa

matches for abc in abcxaaabbbccyaxbcayaaaxycab:
abc
bca
cab

Did some benchmark, the code above found 30815 matches of "abc" in a random string of 36M in just 4,5 seconds. As Jim already said, thanks for this puzzle...

Upvotes: 1

smocking
smocking

Reputation: 3709

The second approach seems very elegant to me and should be perfectly acceptable. I think it scales at O(M * N log N), where N is word length and M is text length.

I can come up with a somewhat more complex O(M) algorithm:

  1. Count the occurrence of each character in the word
  2. Do the same for the first N (i.e. length(word)) characters of the text
  3. Subtract the two frequency vectors, yielding subFreq
  4. Count the number of non-zeroes in subFreq, yielding numDiff
  5. If numDiff equals zero, there is a match
  6. Update subFreq and numDiff in constant time by updating for the first and after-last character in the text
  7. Go to 5 until reaching the end of the text

EDIT: See that several similar answers have been posted. Most of this algorithm is equivalent to the rolling frequency counting suggested by others. My humble addition is also updating the number of differences in a rolling fashion, yielding an O(M+N) algorithm rather than an O(M*N) one.

EDIT2: Just saw that Max has basically suggested this in the comments, so brownie points to him.

Upvotes: 1

Jim Mischel
Jim Mischel

Reputation: 133995

You should be able to do this in a single pass. Start by building a map that contains all the characters in the word you're searching for. So initially the map contains [a, b, c].

Now, go through the text one character at a time. The loop looks something like this, in pseudo-code.

found_string = "";
for each character in text
    if character is in map
        remove character from map
        append character to found_string
        if map is empty
            output found_string
            found_string = ""
            add all characters back to map
        end if
    else
        // not a permutation of the string you're searching for
        refresh map with characters from found_string
        found_string = ""
    end if
end for

If you want unique occurrences, change the output step so that it adds the found strings to a map. That'll eliminate duplicates.

There's the issue of words that contain duplicated letters. If that's a problem, make the key the letter and the value a count. 'Removing' a character means decrementing its count in the map. If the count goes to 0, then the character is in effect removed from the map.

The algorithm as written won't find overlapping occurrences. That is, given the text abcba, it will only find abc. If you want to handle overlapping occurrences, you can modify the algorithm so that when it finds a match, it decrements the index by one minus the length of the found string.

That was a fun puzzle. Thanks.

Upvotes: 3

A B
A B

Reputation: 4148

This is what I would do - set up a flag array with one element equal to 0 or 1 to indicate whether that character in STR had been matched

Set the first result string RESULT to empty.

for each character C in TEXT:

Set an array X equal to the length of STR to all zeroes.

for each character S in STR: If C is the JTH character in STR, and X[J] == 0, then set X[J] <= 1 and add C to RESULT. If the length of RESULT is equal to STR, add RESULT to a list of permutations and set the elements of X[] to zeroes again.

If C is not any character J in STR having X[J]==0, then set the elements of X[] to zeroes again.

Upvotes: 1

Related Questions