Reputation: 23850
I need to loop every row of a dataset 100k times.
This dataset contains 1 Primary key and another string column. Dataset has 600k rows.
So at the moment i am looping like this
for (int i = 0; i < dsProductNameInfo.Tables[0].Rows.Count; i++)
{
for (int k = 0; k < dsFull.Tables[0].Rows.Count; k++)
{
}
}
Now dsProductNameInfo has 100k rows and dsFull has 600k rows. Should i convert dsFull to a KeyValuePaired string list and loop that or there would not be any speed difference.
What solution would work fastest ?
Thank you.
C# 4.0 WPF application
Upvotes: 1
Views: 3780
Reputation: 21752
There might be quite a few things that could be optimized not related to the looping. E.g. reducing the number of iteration would yield a lot at pressent the body of the inner loop is executed 100k * 600k times so eliminating one iteration of the outer loop would eliminate 600k iterations of the inner (or you might be able to switch the inner and outer loop if it's easier to remove iterations from the inner loop)
One thing that you could do in any case is only index once for each table:
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows
var productInfoCount = productNameInfoRows.Count;
var fullRows = dsFull.Tables[0].Rows;
var fullCount = fullRows.Count;
for (int i = 0; i < productInfoCount; i++)
{
for (int k = 0; k < fullCount; k++)
{
}
}
inside the loops you'd get to the rows with productNameInfoRows[i]
and FullRows[k]
which is faster than using the long hand I'm guessing there might be more to gain from optimizing the body than the way you are looping over the collection. Unless of course you have already profiled the code and found the actual looping to be the bottle neck
EDIT After reading your comment to Marc about what you are trying to accomplish. Here's a go at how you could do this. It's worth noting that the below algorithm is probabalistic. That is there's a 1:2^32 for two words being seen as equal without actually being it. It is however a lot faster than comparing strings. The code assumes that the first column is the one you are comparing.
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
var full = new List<int[]>(fullCount);
for (int i = 0; i < productInfoCount; i++){
//we're going to compare has codes and not strings
var prd = productNameInfoRows[i][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray();
for (int k = 0; k < fullCount; k++){
//caches the calculation for all subsequent oterations of the outer loop
if (i == 0) {
full.Add(fullRows[k][0].ToString().Split(';')
.Select(s => s.GetHashCode()).OrderBy(t=>t).ToArray());
}
var fl = full[k];
var count = 0;
for(var j = 0;j<fl.Length;j++){
var f = fl[j];
//the values are sorted so we can exit early
for(var m = 0;m<prd.Length && prd[m] <= f;m++){
count += prd[m] == f ? 1 : 0;
}
}
if((double)(fl.Length + prd.Length)/count >= 0.6){
//there's a match
}
}
}
EDIT your comment motivated me to give it another try. The below code could have fewer iterations. Could have is because it depends on the number of matches and the number of unique words. A lot of unique words and a lot of matches for each (which would require a LOT of words per column) would potentially yield more iterations. However under the assumption that each row has few words this should yield substantial fewer iterations. your code has a NM complexity this has N+M+(matchesproductInfoMatches*fullMatches). In other words the latter would have to be almost 99999*600k for this to have more iterations than yours
//store all the values that will not change through the execution for faster access
var productNameInfoRows = dsProductNameInfo.Tables[0].Rows;
var fullRows = dsFull.Tables[0].Rows;
var productInfoCount = productNameInfoRows.Count;
var fullCount = fullRows.Count;
//Create a list of the words from the product info
var lists = new Dictionary<int, Tuple<List<int>, List<int>>>(productInfoCount*3);
for(var i = 0;i<productInfoCount;i++){
foreach (var token in productNameInfoRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (!lists.ContainsKey(token)){
lists.Add(token, Tuple.Create(new List<int>(), new List<int>()));
}
lists[token].Item1.Add(i);
}
}
//Pair words from full with those from productinfo
for(var i = 0;i<fullCount;i++){
foreach (var token in fullRows[i][0].ToString().Split(';')
.Select(p => p.GetHashCode())){
if (lists.ContainsKey(token)){
lists[token].Item2.Add(i);
}
}
}
//Count all matches for each pair of rows
var counts = new Dictionary<int, Dictionary<int, int>>();
foreach(var key in lists.Keys){
foreach(var p in lists[key].Item1){
if(!counts.ContainsKey(p)){
counts.Add(p,new Dictionary<int, int>());
}
foreach(var f in lists[key].Item2){
var dic = counts[p];
if(!dic.ContainsKey(f)){
dic.Add(f,0);
}
dic[f]++;
}
}
}
Upvotes: 1
Reputation: 56
In your comment reply to my earlier answer you said that both datasets are essentially lists of strings and each string a delimited list of tags effectively. I would first look to normalise the csv strings in the database. I.e. Split the CSVs, add them to a tag table and link from the product to the tags via a link table.
You can then quite easily create a SQL statement that will do your matching according to the link records rather than by string (which be more performant in it's own right).
The issue you would then have is that if your sub-set product list needs to be passed into SQL from .net you would need to call the SP 100k times. Thankfully SQL 2008 R2, introduced TableTypes. You could define a table type in your database with one column to hold your product ID, have your SP accept that as an input parameter and then perform an inner join between your actual tables and your table parameter.. I've used this in my own project with very large datasets and the performance gain was massive.
On the .net side you can create a DataTable matching the structure of the SQL table type and then pass that as a command parameter when calling your SP (once!).
This article shows you how to do both the SQL and .net sides. http://www.mssqltips.com/sqlservertip/2112/table-value-parameters-in-sql-server-2008-and-net-c/
Upvotes: 1
Reputation: 56
What are you doing with the data inside the nested loop?
Is the source of your datasets a SQL database? If so, the best possible performance you could get would be to perform your calculation in SQL using an inner join and return the result to .net.
Another alternative would be to use the dataset's built in querying methods that act like SQL, but in-memory.
If neither of those options are appropriate, you would get a performance improvement by retrieving the 'full' dataset as a DataReader and looping over it as the outer loop. A dataset loads all of the data from SQL into memory in one hit. With 600k rows, this will take up a lot of memory! Whereas a DataReader will keep the connection to the DB open and stream rows as they are read. Once you have read a row the memory will be reused/reclaimed by the garbage collector.
Upvotes: 1
Reputation: 13549
In the exact scenario you mentioned, the performance would be the same except converting to the list would take some time and cause the list approach to be slower. You can easily find out by writing a unit test and timing it.
I would think it'd be best to do this:
// create a class for each type of object you're going to be dealing with
public class ProductNameInformation { ... }
public class Product { ... }
// load a list from a SqlDataReader (much faster than loading a DataSet)
List<Product> products = GetProductsUsingSqlDataReader(); // don't actually call it that :)
// The only thing I can think of where DataSets are better is indexing certain columns.
// So if you have indices, just emulate them with a hashtable:
Dictionary<string, Product> index1 = products.ToDictionary( ... );
Here are references to the SqlDataReader and ToDictionary concepts that you may or may not be familiar with.
The real question is, why isn't this kind of heavy processing done at the database layer? SQL servers are much more optimized for this type of work. Also, you may not have to actually do this, why don't you post the original problem and maybe we can help you optimize deeper?
HTH
Upvotes: 2
Reputation: 1063774
If performance is the critical factor, then I would suggest trying an array-of-struct; this has minimal indireaction (DataSet/DataTable has quite a lot of indirection). You mention KeyValuePair, and that would work, although it might not necessarily be my first choice. Milimetric is right to say that there is an overhead if you create a DataSet first and then build an array/list from tht - however, even then the time savings when looping may exceed the build time. If you can restructure the load to remove the DataSet completely, great.
I would also look carefully at the loops, to see if anything could reduce the actual work needed; for example, would building a dictionary/grouping allow faster lookups? Would sorting allow binary search? Can any operations be per-aggregated and applied at a higher level (with fewer rows)?
Upvotes: 1