Neir0
Neir0

Reputation: 13367

Database choose for fast update

Let's we have a lot of such classes(millions)

class WordInfo
{
     string Value;
     string SomeOtherFeatures;
     List<Point> Points;
}

And following code

 private Dictionary<string, WordInfo> _dict;

   public void ProcessData(IEnumerable<Tuple<string,int,int> words)
   {
        foreach(var word in words)
        {
             if(_dict.ContainsKey(word.Item1))
             {
                 _dict[word.Item1].Points.Add(new Point(word.Item2,word.Item3));
             }
             else
             {
                 _dict.Add(word.Item1, new WordInfo(....))
             }
        } 
   }


   Main()
   {
       while(true)
       {
           IEnumerable<Tuple<string,int,int> data = GetDataSomewhere();
           ProcessData(data); 
       }
   }

As you can see this code must work 24\7. The main problem is that i donnt know how to represent _dict (place where i store information) in database. I need to process 1000-5000 words per second. Relational db is not good for my task, right? What about NoSQL? I need fast UPDATE and INSERT operations. Also i need fast check is word exists(SELECT) in db. Because of i have millions records it's also not trivial. What do you can suggest? May be write my custom solution based on files?

Upvotes: 1

Views: 2420

Answers (1)

Branko Dimitrijevic
Branko Dimitrijevic

Reputation: 52107

A relational database should be able to insert/update 1000-5000 words per second easily, assuming you don't create too many transactions.

Transactions are ACID and "D" means durable: when the client receives a notification that the transaction is committed, it is guaranteed that the effects of the transaction are already in the permanent storage (so even if a power cut happens at that exact moment, the transaction won't be "erased"). In practice, this means the DBMS must wait for the disk to finish the physical write.

If you wrap each and every insert/update in its own transaction, you'll also have to perform this wait for each and every one of them. OTOH, if you wrap many inserts/updates in a single transaction, you'll have to pay this price only once per whole "chunk".


Also, checking for the existence of a specific row within millions of others is a task databases are very good at, thanks to the power of B-Tree indexes.


As for the database structure, you'd need something similar to this:

enter image description here

And you'd process it like this (pseudocode):

BEGIN TRANSACTION;

foreach(var word in words)
{
     try {
         INSERT INTO WORD (WORD_VALUE, SOME_OTHER_FEATURES) VALUES (word.Item1, ...);
     }
     catch (PK violation) {
         // Ignore it.
     }

     try {
         INSERT INTO POINT (WORD_VALUE, X, Y) VALUES (word.Item1, word.Item2, word.Item3);
     }
     catch (PK violation) {
         // Ignore it.
     }
} 

COMMIT;

(NOTE: I'm assuming you never update the SOME_OTHER_FEATURES after it has been initially inserted. If you do, the logic above would be more complicated.)

If your DBMS supports it, consider making both of these tables clustered (aka. index-organized). Also, if your DBMS supports it, compress the leading edge of the POINT's primary index (WORD_VALUE), since all points related to the same word contain same value there.


BTW, the model above uses so-called identifying relationships and natural keys. An alternate model that uses surrogate keys and non-identifying relationships is possible, but would complicate the kind of processing you need.

Upvotes: 2

Related Questions