Reputation: 1201
I'm looping thorugh the rows of a dataset and inserting it into an active-space environment (by tibco, its an in-memory db). This is how i'm doing it.
Is there a faster way to go about this?
I was thinking of partitioning the rows and then paralleling each partition, but i have no clue if that will make it faster.
System.Threading.Tasks.Parallel.ForEach(
dataSet.Tables[0].Rows,
currRow =>
{
var tuple = Com.Tibco.As.Space.Tuple.Create();
for (int i = 0; i < currRow.Values.Length; i++)
{
if (currRow.Values[i] != null)
{
var k = ConvertToAny(currRow.Values[i].ToString());
if (k.GetType().IsEquivalentTo(typeof(DateTime)))
{
tuple.Put(dataSet.Tables[0].ColumnNames[i], (DateTime)k);
}
else if (k.GetType().IsEquivalentTo(typeof(double)))
{
tuple.Put(dataSet.Tables[0].ColumnNames[i], (double)k);
}
else
{
tuple.Put(dataSet.Tables[0].ColumnNames[i], k.ToString());
}
}
}
try
{
inSpace_.Put(tuple);
}
catch (Exception e)
{
}
}
);
I'm thinking of batching it at around 1000 at a time, if someone can please help :(
EDIT:
List tuplesToAdd = new List(); for (int i = 0; i < dataSet.Tables[0].Rows.Length; i++) { var tuple = Com.Tibco.As.Space.Tuple.Create();
for (int j = 0; j < dataSet.Tables[0].Rows[i].Values.Length; j++)
{
if (dataSet.Tables[0].Rows[i].Values[j] != null)
{
var k = ConvertToAny(dataSet.Tables[0].Rows[i].Values[j].ToString());
if (k is DateTime)
{
tuple.Put(dataSet.Tables[0].ColumnNames[j], (DateTime)k);
}
else if (k is Double)
{
tuple.Put(dataSet.Tables[0].ColumnNames[j], (Double)k);
}
else
{
tuple.Put(dataSet.Tables[0].ColumnNames[j], k.ToString());
}
}
}
tuplesToAdd.Add(tuple);
if (i % 100000 == 0 || i == dataSet.Tables[0].Rows.Length - 1)
{
ThreadStart TUPLE_WORKER = delegate
{
inSpace_.PutAll(tuplesToAdd);
};
new Thread(TUPLE_WORKER).Start();
tuplesToAdd.Clear();
}
}
There's my new way of trying to do it (by batching)
Upvotes: 1
Views: 503
Reputation: 133995
I'm not certain, but it looks like you could avoid the ToString
in your conversion code. That is, rather than:
var k = ConvertToAny(currRow.Values[i].ToString());
if (k.GetType().IsEquivalentTo(typeof(DateTime)))
Can be replaced by ...
var k = currRow.Values[i];
if (k is DateTime)
{
tuple.Put(dataSet.Tables[0].ColumnNames[i], (DateTime)k);
}
That should save you converting to string and then back.
Added in response to comments
First, your ConvertToAny
is unnecessary. The item in currRow.Values[i]
is already the right type. You just don't know what type it is. Unless you're saying that it could be the string representation of a DateTime
or a Double
. If the type is already a double, for example, then there's no reason to convert to string, parse, and then convert back. That is, the following two bits of code do the same thing:
object o = 3.14;
var k = ConvertToAny(o.ToString());
if (k.GetType.IsEquivalentTo(typeof(double))
and
object o = 3.14;
if (o is double)
The only difference is that the second will be much faster.
However, if you have
object o = "3.14";
and you want that to be converted to double
, then you'll have to do the conversion.
Your code that batches things has to lock the list when adding and updating. Otherwise you will corrupt it. I would suggest:
lock (tuplesToAdd)
{
tuplesToAdd.Add(tuple);
if ((tuplesToAdd.Count % 10000) == 0)
{
// push them all to the database.
inspace_.PutAll(tuplesToAdd);
tuplesToAdd.Clear();
}
}
And when you're all done (i.e. the Parallel.Foreach
is done):
if (tuplesToAdd.Count > 0)
{
// push the remaining items
}
Now, if you want to avoid blocking all of the threads during the update, you can get a little creative.
First, create two objects that you can lock on:
private object lockObject = new Object();
private object pushLock = new Object();
Create that right after you create the tuplesToAdd
list. Then, when you want to add an item:
Monitor.Enter(lockObject); // acquires the lock
tuplesToAdd.Add(tuple);
if (tuplesToAdd.Count == 100000)
{
var tuplesToPush = tuplesToAdd;
tuplesToAdd = new List<tuple>(10000);
Monitor.Exit(lockObject); // releases the lock so other threads can process
lock (pushLock) // prevent multiple threads from pushing at the same time
{
inspace_.PutAll(tuplesToPush);
}
}
else
{
Monitor.Exit(lockObject);
}
That way, while one thread is updating the database, the others can be filling the list for the next time around.
And after I think about it a bit more, you probably don't even need to use parallel processing for this task. It's likely that the vast majority of your time was being spent by threads waiting on the Put
call. Using a single thread to batch these and write them in bulk will probably execute much faster than your original solution. The parallel version you decided on will be faster, but I doubt that it will be hugely faster.
Upvotes: 1