Reputation: 3913
I have a problem and I think a NoSQL solution is the answer but I am not sure. Also, I am not sure what type of NoSQL DB (Object,Document,Graph,Key,etc) would be best suited to solve this problem.
Problem:
I have two collections. CollectionA contains 2K+ strings (domain names). CollectionB is much larger and looks (pseudo) like this:
{
"To" : "[email protected],[email protected],there_could_be_100@more_address.com",
"Bcc" : "[email protected],[email protected],there_could_be_100@more_address.com",
"From" : "[email protected],[email protected],there_could_be_100@more_address.com",
"Subject" : "Email Subject",
"Unknown" : "NumberOfFields",
"N" : "PlusOneExtraFields",
}
Knowns:
Current line of thinking:
Using a NoSQL solution and maybe the C# dynamic keyword?
Fuzzy
Is this a problem that is easitly solved by a document database?
Is searching/comparing across this type of data structure something that for Map/Reduce?
Upvotes: 2
Views: 436
Reputation: 63548
No, it is not. It is a candidate for a full-text search engine, which is nothing to do with "nosql", whatever that is.
Full-text search engines often use SQL or some variant of it. For example, Sphinx or Lucene. You could also use Microsoft's one (but I don't know if that will satisfy your requirements, you need to check).
Upvotes: 0
Reputation: 65877
I feel this is a right candidate for Apache lucene.net .
You can create a lucene document for the above specified structure like this
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
doc.Add( new Lucene.Net.Documents.Field(
"To",
ToData,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.ANALYZED,
Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.Add(new Lucene.Net.Documents.Field(
"From",
FromData,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.ANALYZED,
Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));
doc.Add(new Lucene.Net.Documents.Field(
"BCC",
BCCData,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.ANALYZED,
Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));
// Since you dont want Unknown field to be indexed, you can make it Index.NO.
doc.Add(new Lucene.Net.Documents.Field(
"Unknown",
BCCData,
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.NO));
But the problem with lucene is you cannot add new field or modify the existing field structure at later time. So you have to delete the documents and create the new ones from scracth.
Better approach would be make all your fields indexable for the unknown fields.
Upvotes: 0
Reputation: 5082
I totally agree with @HighTechRider, denormalization of data (exploding as you put it) seems a must in this instance for performant queries if the volume of data is as large as you imply, else it doesn't matter what product you pick, it'll end up being a free-text scan of some fashion or other.
@chx's suggestion of Sphinx, seems plausible in at least accelerating the latter. But there are hidden costs to that route - needing you to bundle, install, manage, patch, update etc. someone else's service alongside your software.
Minimizing desktop resource consumption in indexing and query have to be high priority, and setting up a free-text server on a desktop seems somewhat contra that charter.
I'd start with either basic file-system - using filesystem objects to represent your denormalized data. Or if representing and executing your queries seems too complex, look at simple embedded table libraries like SQLite or SQL Compact edition before trying shoehorn more exotic server-targetted products onto the desktop.
Nice comparison of SQLite vs. SQL Compact Edition here:
SQLite can also create free-text indexes that cover some of your "unknown field" scenarios in future.
As for map-reduce, it's strategy is valid for the domain you're approaching.
Upvotes: 1
Reputation: 11760
Store in XML and search with sphinx. Use xmlpipe2 to feed sphinx through something like grep to feed only the known fields into it. Once you need to search on more, add those fields to your filter and the schema and reindex. Sphinx can index at such speeds this poses no real problem. Can be distributed too.
You are calling for text search, well, that means solr or sphinx and between the two sphinx is waaaaaaaaay easier to set up on a Windows desktop.
Upvotes: 0