Reputation: 147
Forgive me, this will be my first every post to SO, so do let me know how I can improve.
I am currently looking for advice on a problem I am facing. I have a list of one billion unique strings of text. These text strings also have a list of tags associated with them to indicate the content of the string.
Example:
StringText: The cat ate on Sunday
AnimalCode: c001
ActionCode: a001
TimeCode: d001
where
c001 = The cat
a001= ate
d001 = on Sunday
I have loaded all of the strings and their codes as individual documents in an instance of MongoDB
At present, I am trying to devise a method by which I can enter a string and search against the database to return the match. My problem is that the search is taking far to long to return results.
I have created an index on the StringText field, but am guessing that it is too large to hold in memory.
Each string has an equal probability of being searched for so I can't reliably predict which strings have a higher probability of being searched for and pull them out into another collection.
Currently, I am running the DB off a single box with 16GB of RAM and a 4TB HDD.
Does anybody have any advice on how I might accomplish my task more efficiently? Is Mongo the right technology or are there others more adept at doing this kind of search and return?
My goal (forgive me if foolish) would be to try and return a result within 2 seconds or less.
I am very new to this whole arena so any and all advice would be welcome.
Thanks much to all in advance for the help and time.
Sincerely, Zinga
Upvotes: 0
Views: 42
Reputation: 1935
As discussed in the comments, you could preprocess the input string to find the associated Animal and Action codes and search for StringText based on the indexed codes, which is much faster than text search.
You can't totally avoid text search, so reduce it to the Animal and/or Action collection by tokenizing the input string. See how you can use map/reduce techniques just for queries of this sort.
In your case, if you know that the first word or two will always contain the name of the animal, just use those one or two words to search for the relevant animal. Searching through the Animal/Actions collection shouldn't take long. In case it does, you can keep a periodically updating list of most common animals/actions (based on their frequency) and search against that to make it faster. This is also discussed in the articles on the linked page.
If even after that your search against StringText is slow, you could shard the StringText collection by Animal/Action codes. The official doc should suffice for this and there's not much that's involved in the setup so you might try this anyway. The basic ideology everywhere is to restrict your target space as much as possible. Searching through a billion records for every query is plain overkill. Cache where you can, preprocess where you can, show guesses while you run a slow query.
Good luck!
Upvotes: 1