Reputation: 365
I want to implement a sortedMap with keys and values such that the keys can be searched by providing some subsequence. For example, the map contains 3 entries:
abcd -> obj1
def -> obj2
abccd -> obj3
For query ac
, result should be a submap containing the 1st and 3rd entries, but for query acc
, only 3rd entry should be returned.
What kind of data structure should I use internally to efficiently return such submap? For example, the Treemap
which stores the keys in a tree (trie) to efficiently return the submap based on the prefix?
Upvotes: 2
Views: 203
Reputation: 165
/**
* @author viborole
*
*/
public class SortedMapSubsequence {
/**
* @param args
*/
public static void main(String[] args) {
Map<String, String> data = new HashMap<String,String>();
data.put("abcd", "obj1");
data.put("def", "obj2");
data.put("abccd", "obj3");
String query="acc";
search(query, data);
}
private static void search(String query, Map<String, String> data) {
for (Map.Entry<String, String> entry : data.entrySet())
{
String key=entry.getKey();
char[] k=key.toCharArray();
char[] q=query.toCharArray();
int kpos=0;
char[] found = new char[q.length];
for(int i=0;i<q.length;i++){
if(i>0){
kpos++;
}
for(int j=kpos;j<k.length;j++){
if(k[j]==q[i]){
kpos=j;
found[i]=k[j];
break;
}
}
}
if(Arrays.equals(found,q)){
System.out.println("found : "+ entry.getValue());
}
}
}
}
Upvotes: 0
Reputation: 12592
You can try an aho-corasick with an wildcard. Aho-Corasick is the faster multiple pattern matching algorithm and with a wildcard it gives all substrings. You can try my php implementation at codeplex (https://phpahocorasick.codeplex.com). For example an unittest with wildcard for gene pattern:
$tree = new Ahocorasick\Ahocorasick();
$tree->add("AC");
$tree->add("GTG");
$tree->add("AACT");
echo $tree->match("ACCGAGTGCGTGGACAAACTACGATTGTGGAATGAACT","AC*GT");
$this->expectOutputString("ACCGAGT,ACCGAGTGCGT,ACCGAGTGCGTGGACAAACTACGATTGT,ACAAACTACGATTGT,ACTACGATTGT,ACGATTGT");
Upvotes: 0
Reputation: 17945
Your "characters" can be interpreted as terms in a traditional search-engine AND-query.
From this comment:
all the subsequence of abcd, such as a,b,c,d,ab, ac, ad, bc, bd,cd , abc, abd, acd, abcd etc, when queried, should return a submap with abcd -> obj1 as one of its entry.
We could interpret that, with a document with terms Aspid Beaver Cherokee Deer (ABCD), the document should be returned when searching for either of its words, or any of their combinations.
What you want to build for this is an inverted index. See this answer for more details. Essentially, you would build a HashMap
or lookup table (there are not that many chars) for each "character" (=term in search-engine-speak), where a lookup would return all objX
where it appears (=document in search-engine-speak), ideally in ascending order.
To find all objects associated with a set of characters, you would join the sets for each individual character. Since they are ordered, you can calculate set-intersection in linear time.
If querying for ba
should not match an abc
key, you can refine the lookup-table / HashMap
to store character positions (eg.: store "b is found in Obj1 at position 0", "a is found in Obj1 at position 1", instead of discarding positions). When calculating search-intersection, you would proceed in search order and discard matches that have an incorrect relative order.
This is routine for search engines, and has been analyzed exhaustively in terms of performance.
Edit: Performance
If the number of distinct "terms" is low (for instance: the 26 lowercase English letters) you can speed this up by adding n-grams as terms (where, for "abc", you would add "a", "b", "c", "ab", "ac", "bc", and "abc"). The same rules would apply - with half the searches, and since overlap would be expected, Sasha's trick could be used to avoid storing indices.
Upvotes: 0
Reputation: 14348
To summarize what I've written in comments, I'd make it as following:
HashMap
with entries [2-char subsequence -> list of words].If a query consists of one character, then perform full scan. I believe this would have better space/complexity than creating additional index for single chars, especially when 1-char queries are rare.
Upvotes: 1
Reputation: 2078
Note: My solution works in case of looking for substring. Read edit.
Solution:
Use prefix data structure: http://en.wikipedia.org/wiki/Trie
You will need to store:
abcd -> obj1
as:
abcd -> obj1
bcd -> obj1
cd -> obj1
d -> obj1
To find your result, you would then go deep enough in the trie to satisfy condition and DFS from there to find all possible solutions.
Examples:
ab -> obj1
cde -> obj1
bad -> obj2
We would insert next entries in our Trie:
ab -> obj1
b -> obj1
cde -> obj1
de -> obj1
e -> obj1
bad -> obj2
ad -> obj2
d -> obj2
Now imagine we look for next entries:
d
First move down for character d in Trie. After that we would DFS. We are already at obj2 and if we move for character e for at obj1. So we can get both of them with entry d.
cd
First we move for character c and then for character d. Now we DFS and only path we can find is by adding character e which leads to obj1.
efg
There is not path in Trie which satisfies out condition, therefore we get no solutions.
EDIT:
My solution works if we are looking for a substring of original element. To modify my solution for it to work for any non-contagious substring you would have to insert all the permutation aswell in the Trie.
Upvotes: 0