I have a problem and I think a NoSQL solution is the answer but I am not sure. Also, I am not sure what type of NoSQL DB (Object,Document,Graph,Key,etc) would be best suited to solve this problem. Problem: I have two collections. CollectionA contains 2K+ strings (domain names). CollectionB is much larger and looks (pseudo) like this: { "To" : "address1@address1.com,address2@address2.com,there_could_be_100@more_address.com", "Bcc" : "address1@address1.com,address2@address2.com,there_could_be_100@more_address.com", "From" : "address1@address1.com,address2@address2.com,there_could_be_100@more_address.com", "Subject" : "Email Subject", "Unknown" : "NumberOfFields", "N" : "PlusOneExtraFields", } Knowns: There can be 100s of people listed in the To, Bcc, and From strings. I don't have a good way to explode the To, From, Bcc fields. Without a way to explode the To, From, Bcc fields I am forced to search strings. There are a few known fields but many unknown fields. The requirements don't currently call for searching across the unknown fields. The database engine needs to run on a windows desktop. Current line of thinking: Using a NoSQL solution and maybe the C# dynamic keyword? Fuzzy Is this a problem that is easitly solved by a document database? Is searching/comparing across this type of data structure something that for Map/Reduce?

Reputation: 3913

Is NoSQL the best option for this specific database problem

I have a problem and I think a NoSQL solution is the answer but I am not sure. Also, I am not sure what type of NoSQL DB (Object,Document,Graph,Key,etc) would be best suited to solve this problem.

Problem:

I have two collections. CollectionA contains 2K+ strings (domain names). CollectionB is much larger and looks (pseudo) like this:

{
    "To" : "[email protected],[email protected],there_could_be_100@more_address.com",  
    "Bcc" : "[email protected],[email protected],there_could_be_100@more_address.com",  
 "From" : "[email protected],[email protected],there_could_be_100@more_address.com", 
 "Subject" : "Email Subject", 
 "Unknown" : "NumberOfFields", 
 "N" : "PlusOneExtraFields", 
}

Knowns:

There can be 100s of people listed in the To, Bcc, and From strings.
I don't have a good way to explode the To, From, Bcc fields.
Without a way to explode the To, From, Bcc fields I am forced to search strings.
There are a few known fields but many unknown fields.
The requirements don't currently call for searching across the unknown fields.
The database engine needs to run on a windows desktop.

Current line of thinking:

Using a NoSQL solution and maybe the C# dynamic keyword?

Fuzzy

Is this a problem that is easitly solved by a document database?
Is searching/comparing across this type of data structure something that for Map/Reduce?

Upvotes: 2

Answers (4)

MarkR

Reputation: 63548

No, it is not. It is a candidate for a full-text search engine, which is nothing to do with "nosql", whatever that is.

Full-text search engines often use SQL or some variant of it. For example, Sphinx or Lucene. You could also use Microsoft's one (but I don't know if that will satisfy your requirements, you need to check).

Upvotes: 0

RameshVel

Reputation: 65877

I feel this is a right candidate for Apache lucene.net .

You can create a lucene document for the above specified structure like this

         Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();

         doc.Add( new Lucene.Net.Documents.Field(
             "To",
             ToData,
             Lucene.Net.Documents.Field.Store.YES,
             Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));


         doc.Add(new Lucene.Net.Documents.Field(
             "From",
             FromData,
             Lucene.Net.Documents.Field.Store.YES,
              Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

         doc.Add(new Lucene.Net.Documents.Field(
            "BCC",
            BCCData,
            Lucene.Net.Documents.Field.Store.YES,
            Lucene.Net.Documents.Field.Index.ANALYZED,
             Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS));

    // Since you dont want Unknown field to be indexed, you can make it Index.NO.
        doc.Add(new Lucene.Net.Documents.Field(
            "Unknown",
            BCCData,
           Lucene.Net.Documents.Field.Store.YES,
             Lucene.Net.Documents.Field.Index.NO));

But the problem with lucene is you cannot add new field or modify the existing field structure at later time. So you have to delete the documents and create the new ones from scracth.

Better approach would be make all your fields indexable for the unknown fields.

Upvotes: 0

stephbu

Reputation: 5082

I totally agree with @HighTechRider, denormalization of data (exploding as you put it) seems a must in this instance for performant queries if the volume of data is as large as you imply, else it doesn't matter what product you pick, it'll end up being a free-text scan of some fashion or other.

@chx's suggestion of Sphinx, seems plausible in at least accelerating the latter. But there are hidden costs to that route - needing you to bundle, install, manage, patch, update etc. someone else's service alongside your software.

Minimizing desktop resource consumption in indexing and query have to be high priority, and setting up a free-text server on a desktop seems somewhat contra that charter.

I'd start with either basic file-system - using filesystem objects to represent your denormalized data. Or if representing and executing your queries seems too complex, look at simple embedded table libraries like SQLite or SQL Compact edition before trying shoehorn more exotic server-targetted products onto the desktop.

Nice comparison of SQLite vs. SQL Compact Edition here:

http://www.tech-archive.net/Archive/DotNet/microsoft.public.dotnet.framework.compactframework/2005-12/msg00019.html

SQLite can also create free-text indexes that cover some of your "unknown field" scenarios in future.

As for map-reduce, it's strategy is valid for the domain you're approaching.

Upvotes: 1

chx

Reputation: 11760

Store in XML and search with sphinx. Use xmlpipe2 to feed sphinx through something like grep to feed only the known fields into it. Once you need to search on more, add those fields to your filter and the schema and reindex. Sphinx can index at such speeds this poses no real problem. Can be distributed too.

You are calling for text search, well, that means solr or sphinx and between the two sphinx is waaaaaaaaay easier to set up on a Windows desktop.

Upvotes: 0

Is NoSQL the best option for this specific database problem

Answers (4)

Related Questions