user2436032
user2436032

Reputation: 365

Implement a sortedMap with subsequence matching

I want to implement a sortedMap with keys and values such that the keys can be searched by providing some subsequence. For example, the map contains 3 entries:

abcd -> obj1
def -> obj2
abccd -> obj3

For query ac, result should be a submap containing the 1st and 3rd entries, but for query acc, only 3rd entry should be returned.

What kind of data structure should I use internally to efficiently return such submap? For example, the Treemap which stores the keys in a tree (trie) to efficiently return the submap based on the prefix?

Upvotes: 2

Views: 203

Answers (5)

Vinodborole
Vinodborole

Reputation: 165

/**
 * @author viborole
 *
 */
public class SortedMapSubsequence {

/**
 * @param args
 */
public static void main(String[] args) {
    Map<String, String> data = new HashMap<String,String>();
    data.put("abcd", "obj1");
    data.put("def", "obj2");
    data.put("abccd", "obj3");
    String query="acc";
    search(query, data);
}

private static void search(String query, Map<String, String> data) {
    for (Map.Entry<String, String> entry : data.entrySet())
    {
        String key=entry.getKey();
        char[] k=key.toCharArray();
        char[] q=query.toCharArray();
        int kpos=0;
        char[] found = new char[q.length];
        for(int i=0;i<q.length;i++){
            if(i>0){
                kpos++;
            }
            for(int j=kpos;j<k.length;j++){
                if(k[j]==q[i]){
                    kpos=j;
                    found[i]=k[j];
                    break;
                } 
            }
        }
        if(Arrays.equals(found,q)){
            System.out.println("found : "+ entry.getValue());
        }
    }
}

}

Upvotes: 0

Cybercartel
Cybercartel

Reputation: 12592

You can try an aho-corasick with an wildcard. Aho-Corasick is the faster multiple pattern matching algorithm and with a wildcard it gives all substrings. You can try my php implementation at codeplex (https://phpahocorasick.codeplex.com). For example an unittest with wildcard for gene pattern:

$tree = new Ahocorasick\Ahocorasick();
$tree->add("AC");
$tree->add("GTG");
$tree->add("AACT");
echo $tree->match("ACCGAGTGCGTGGACAAACTACGATTGTGGAATGAACT","AC*GT");
$this->expectOutputString("ACCGAGT,ACCGAGTGCGT,ACCGAGTGCGTGGACAAACTACGATTGT,ACAAACTACGATTGT,ACTACGATTGT,ACGATTGT");

Upvotes: 0

tucuxi
tucuxi

Reputation: 17945

Your "characters" can be interpreted as terms in a traditional search-engine AND-query.

From this comment:

all the subsequence of abcd, such as a,b,c,d,ab, ac, ad, bc, bd,cd , abc, abd, acd, abcd etc, when queried, should return a submap with abcd -> obj1 as one of its entry.

We could interpret that, with a document with terms Aspid Beaver Cherokee Deer (ABCD), the document should be returned when searching for either of its words, or any of their combinations.

What you want to build for this is an inverted index. See this answer for more details. Essentially, you would build a HashMap or lookup table (there are not that many chars) for each "character" (=term in search-engine-speak), where a lookup would return all objX where it appears (=document in search-engine-speak), ideally in ascending order.

To find all objects associated with a set of characters, you would join the sets for each individual character. Since they are ordered, you can calculate set-intersection in linear time.

If querying for ba should not match an abc key, you can refine the lookup-table / HashMap to store character positions (eg.: store "b is found in Obj1 at position 0", "a is found in Obj1 at position 1", instead of discarding positions). When calculating search-intersection, you would proceed in search order and discard matches that have an incorrect relative order.

This is routine for search engines, and has been analyzed exhaustively in terms of performance.


Edit: Performance

If the number of distinct "terms" is low (for instance: the 26 lowercase English letters) you can speed this up by adding n-grams as terms (where, for "abc", you would add "a", "b", "c", "ab", "ac", "bc", and "abc"). The same rules would apply - with half the searches, and since overlap would be expected, Sasha's trick could be used to avoid storing indices.

Upvotes: 0

Alex Salauyou
Alex Salauyou

Reputation: 14348

To summarize what I've written in comments, I'd make it as following:

  1. Create additional index HashMap with entries [2-char subsequence -> list of words].
  2. On addition: generate all distinct 2-char subsequences from given word, add this word to every corresponding entry of index map.
  3. On querying: generate all distinct 2-char subsequences from a query, among them find one which corresponds to the shortest list in index map, take this list. Filter it by full query and collect corresponding values from the main map.

If a query consists of one character, then perform full scan. I believe this would have better space/complexity than creating additional index for single chars, especially when 1-char queries are rare.

Upvotes: 1

Neithrik
Neithrik

Reputation: 2078

Note: My solution works in case of looking for substring. Read edit.

Solution:

Use prefix data structure: http://en.wikipedia.org/wiki/Trie

You will need to store:

abcd -> obj1

as:

abcd -> obj1
bcd -> obj1
cd -> obj1
d -> obj1

To find your result, you would then go deep enough in the trie to satisfy condition and DFS from there to find all possible solutions.

Examples:

ab -> obj1
cde -> obj1
bad -> obj2

We would insert next entries in our Trie:

ab -> obj1
b -> obj1
cde -> obj1
de -> obj1
e -> obj1
bad -> obj2
ad -> obj2
d -> obj2

Now imagine we look for next entries:

d

First move down for character d in Trie. After that we would DFS. We are already at obj2 and if we move for character e for at obj1. So we can get both of them with entry d.

cd

First we move for character c and then for character d. Now we DFS and only path we can find is by adding character e which leads to obj1.

efg

There is not path in Trie which satisfies out condition, therefore we get no solutions.

EDIT:

My solution works if we are looking for a substring of original element. To modify my solution for it to work for any non-contagious substring you would have to insert all the permutation aswell in the Trie.

Upvotes: 0

Related Questions