Reputation: 6390
We have many documents that consists of words. What is most appropriate way to index the documents. A search query should support the AND/OR operations.
The query runtime should be efficient as possible. Please describe the space required for the index.
The documents contain words only(excluding AND/OR) and the query contains words and the keywords AND/OR.
EDIT: what would be the algorithm if I would allow only 2 keywords and an operation (e.g. w1 AND w2)
Upvotes: 3
Views: 90
Reputation: 8898
The basic data structured needed is an inverted index. This maps each word to the set of documents that contain it. Lets say lookup
is a function from words to document sets: lookup: W -> Pos(D)
(where W
is the set of words, D
the set of documents, Pos(D)
the power set of D
).
Lets say you have an query expression of the form:
Expr ::= Word | AndExpression | OrExpression
AndExpression ::= Expr 'AND' Expr
OrExpression ::= Expr 'OR' Expr
So you get an abstract syntax tree representing your query. That's a tree with the following kind of nodes:
abstract class Expression { }
class Word extends Expression {
String word
}
class AndExpression extends Expression {
Expression left
Expression right
}
class OrExpression extends Expression {
Expression left
Expression right
}
For example, foo AND (bar OR baz)
would be translated to this tree:
AndExpression
/ \
/ \
Word('foo') OrExpression
/ \
/ \
Word('bar') Word('baz')
To evaluate this tree, follow these simple rules, expressed in pseudocode:
Set<Document> evaluate(Expr e) {
if (e is Word)
return lookup(e.word)
else if (e is AndExpression)
return intersection(evaluate(e.left), evaluate(e.right))
else if (e is OrExpression)
return union(evaluate(e.left), evaluate(e.right))
//otherwise, throw assertion error: no other case remaining
}
//implemented by the inverted index, not shown
Set<Document> lookup(String word)
Thus, AND
expressions are basically translated to set intersections, while OR
expressions are translated to union expressions, all evaluated recursively. I'm sure if you stare long enough on the above, you'll see its beauty :)
You could represent each set (that lookup
returns) as a HashSet. If you use Java, You could also use guava's lazy union and intersection implementations, that should be fun (especially if you study the code or use your imagination to see what 'lazy' really means in this context).
To the best of my knowledge, though, intersections are rarely computed by intersecting hashtables - instead, what usually happens is the following: assume are 3 sets to be intersected, we pick one (the smallest preferably) and assign a counter (equal to 1) to each of the documents. We then iterate the other sets, incrementing the counter of each found document. Finally, we report each document that its counter becomes 3 (that means that the document appeared in all sets, thus exists in their intersection).
Upvotes: 1
Reputation: 6390
In case only 2 keywords are allowed (e.g. "key1 AND key2")
This is the solution I found so far: 1) keyMap, HashMap, where the key is the keywords and the value is the LinkedList of documents. 2) docMap, HashMap, where the key is the document id and the value is an HashSet of keywords
Now on such a query ("key1 AND key2") I would:
LinkedList docs = keyMap.get(key1);
for each (HashSet doc:docs)
if(doc.contains(keys))
result.add(doc);
return result
What do you think? Is there any better way? How about 3 keywords?
Upvotes: 1
Reputation: 3585
As @Wrikken said, use lucene.
As you are interested in the algorithms used there are many, you can find a starting point here and more information here.
Upvotes: 0
Reputation: 70500
With many documents, a dedicated package like Lucene becomes very attractive.
Upvotes: 0
Reputation: 78344
Google Desktop springs to mind, but all the major O/S have similar features built in or easily added. I'd describe the space used by Spotlight on my Mac as 'reasonable'.
Upvotes: 0