Ramie
Ramie

Reputation: 1201

Fast way to Insert rows into DB?

I'm looping thorugh the rows of a dataset and inserting it into an active-space environment (by tibco, its an in-memory db). This is how i'm doing it.

Is there a faster way to go about this?

I was thinking of partitioning the rows and then paralleling each partition, but i have no clue if that will make it faster.

System.Threading.Tasks.Parallel.ForEach(
    dataSet.Tables[0].Rows,
    currRow =>
    {
        var tuple = Com.Tibco.As.Space.Tuple.Create();

        for (int i = 0; i < currRow.Values.Length; i++)
        {
            if (currRow.Values[i] != null)
            {
                var k = ConvertToAny(currRow.Values[i].ToString());

                if (k.GetType().IsEquivalentTo(typeof(DateTime)))
                {
                    tuple.Put(dataSet.Tables[0].ColumnNames[i], (DateTime)k);
                }
                else if (k.GetType().IsEquivalentTo(typeof(double)))
                {
                    tuple.Put(dataSet.Tables[0].ColumnNames[i], (double)k);
                }
                else
                {
                    tuple.Put(dataSet.Tables[0].ColumnNames[i], k.ToString());
                }
            }
        }
        try
        {
            inSpace_.Put(tuple);
        }
        catch (Exception e)
        {
        }
    }
);

I'm thinking of batching it at around 1000 at a time, if someone can please help :(

EDIT:

List tuplesToAdd = new List(); for (int i = 0; i < dataSet.Tables[0].Rows.Length; i++) { var tuple = Com.Tibco.As.Space.Tuple.Create();

            for (int j = 0; j < dataSet.Tables[0].Rows[i].Values.Length; j++)
            {
                if (dataSet.Tables[0].Rows[i].Values[j] != null)
                {
                    var k = ConvertToAny(dataSet.Tables[0].Rows[i].Values[j].ToString());
                    if (k is DateTime)
                    {
                        tuple.Put(dataSet.Tables[0].ColumnNames[j], (DateTime)k);
                    }
                    else if (k is Double)
                    {
                        tuple.Put(dataSet.Tables[0].ColumnNames[j], (Double)k);
                    }
                    else
                    {
                        tuple.Put(dataSet.Tables[0].ColumnNames[j], k.ToString());
                    }
                }
            }

            tuplesToAdd.Add(tuple);

            if (i % 100000 == 0 || i == dataSet.Tables[0].Rows.Length - 1)
            {

                ThreadStart TUPLE_WORKER = delegate
                {
                    inSpace_.PutAll(tuplesToAdd);
                };
                new Thread(TUPLE_WORKER).Start();
                tuplesToAdd.Clear();
            }
        }

There's my new way of trying to do it (by batching)

Upvotes: 1

Views: 503

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 133995

I'm not certain, but it looks like you could avoid the ToString in your conversion code. That is, rather than:

var k = ConvertToAny(currRow.Values[i].ToString());

if (k.GetType().IsEquivalentTo(typeof(DateTime)))

Can be replaced by ...

var k = currRow.Values[i];
if (k is DateTime)
{
    tuple.Put(dataSet.Tables[0].ColumnNames[i], (DateTime)k);
}

That should save you converting to string and then back.

Added in response to comments

First, your ConvertToAny is unnecessary. The item in currRow.Values[i] is already the right type. You just don't know what type it is. Unless you're saying that it could be the string representation of a DateTime or a Double. If the type is already a double, for example, then there's no reason to convert to string, parse, and then convert back. That is, the following two bits of code do the same thing:

object o = 3.14;
var k = ConvertToAny(o.ToString());
if (k.GetType.IsEquivalentTo(typeof(double))

and

object o = 3.14;
if (o is double)

The only difference is that the second will be much faster.

However, if you have

object o = "3.14";

and you want that to be converted to double, then you'll have to do the conversion.

Your code that batches things has to lock the list when adding and updating. Otherwise you will corrupt it. I would suggest:

lock (tuplesToAdd)
{
    tuplesToAdd.Add(tuple);
    if ((tuplesToAdd.Count % 10000) == 0)
    {
        // push them all to the database.
        inspace_.PutAll(tuplesToAdd);
        tuplesToAdd.Clear();
    }
}

And when you're all done (i.e. the Parallel.Foreach is done):

if (tuplesToAdd.Count > 0)
{
    // push the remaining items
}

Now, if you want to avoid blocking all of the threads during the update, you can get a little creative.

First, create two objects that you can lock on:

private object lockObject = new Object();
private object pushLock = new Object();

Create that right after you create the tuplesToAdd list. Then, when you want to add an item:

Monitor.Enter(lockObject); // acquires the lock
tuplesToAdd.Add(tuple);
if (tuplesToAdd.Count == 100000)
{
    var tuplesToPush = tuplesToAdd;
    tuplesToAdd = new List<tuple>(10000);
    Monitor.Exit(lockObject);  // releases the lock so other threads can process
    lock (pushLock)  // prevent multiple threads from pushing at the same time
    {
        inspace_.PutAll(tuplesToPush);
    }
}
else
{
    Monitor.Exit(lockObject);
}

That way, while one thread is updating the database, the others can be filling the list for the next time around.


And after I think about it a bit more, you probably don't even need to use parallel processing for this task. It's likely that the vast majority of your time was being spent by threads waiting on the Put call. Using a single thread to batch these and write them in bulk will probably execute much faster than your original solution. The parallel version you decided on will be faster, but I doubt that it will be hugely faster.

Upvotes: 1

Related Questions