Reputation: 4170
I have a large table, 1B+ records that I need to pull down and run an algorithm on every record. How can I use ADO.NET to exec a "select * from table" asynchronously and start reading the rows one by one while ado.net is receiving the data?
I also need to dispose of the records after I read them to save on memory. So I am looking of a way to pull a table down record by record and basically shove the record into a queue for processing.
My datasources are oracle and mssql. I have to do this for several datasources.
Upvotes: 0
Views: 1992
Reputation: 7316
Just use the DbDataReader
(just like Richard Nienaber said). It is a forward-only way of scrolling through the retrieved data. You don't have to dispose of your data because a DbDataReader is forward only.
When you use the DbDataReader
it seems that the records are retrieved one by one from the database.
It is however slightly more complicated:
Oracle (and probably MySQL) will fetch a few 100 rows at a time to decrease the number of round trips to the database. You can configure the fetch size of DataReader
. Most of the time it will not matter whether you fetch 100 rows or 1000 rows per round trip. However, a very low value like 1 or 2 rows slows things down because with a low value retrieving the data will require too many round trips.
You probably don't have to set the fetch size manually, the default will be just fine.
edit1: See here for an Oracle example: http://www.oracle.com/technology/oramag/oracle/06-jul/o46odp.html
Upvotes: 0
Reputation: 10564
A combination of a DataReader and an iterator block (a.k.a. generator) should be a good fit for this problem. The default DataReaders provided by Microsoft pull data one record at a time from a datasource.
Here's an example in C#:
static IEnumerable<User> RetrieveUsers(DbDataReader reader)
{
while (reader.NextResult())
{
User user = new User
{
Name = reader.GetString(0),
Surname = reader.GetString(1)
};
yield return user;
}
}
Upvotes: 1
Reputation: 294407
You should use SSIS for this.
You need a bit of background detail on how the ADO.Net data providers work to understand what you can do and what you can't do. Lets take the SqlClient
provider for example. It is true that it is possible to execute queries asynchronously with BeginExecuteReader
but this asynchronous execution is only until the query start returning results. At the wire level the SQL text is sent to the server, the server start churning the query execution and eventually will start pushing result rows back to the client. As soon as the first packet comes back to the client, the asynchronous execution is done and the completion callback is executed. After that the client uses the SqlDataReader.Read
()
method to advance the result set. There are no asynchronous methods in the SqlDataReader. This pattern work wonders for complex queries that return few results after some serious processing is done. While the server is busy producing the result, the client is idle with no threads blocked. However things are completely different for simple queries that produce large result sets (as seem to be the case for you): the server will immedeatly produce resutls and will continue to push them back to the client. The asynchronous callback will be almost instantenous and the bulk of the time will be spent by the client iterating over the SqlDataReader.
You say you're thinking of placing the records into an in memory queue first. What is the purpose of the queue? If your algorithm processing is slower than the throughput of the DataReader result set iteration then this queue will start to build up. It will consume live memory and eventualy will exhaust the memory on the client. To prevent this you would have to build in a flow control mechanism, ie. if the queue size is bigger than N don't put any more records into it. But to achieve this you would have to suspend the data reader iteration and if you do this you push flow control to the server which will suspend the query until the communication pipe is available again (until you start reading from the reader). Ultimately the flow control has to be proagated all the way to the server, which is always the case in any producer-consumer relation, the producer has to stop otherwise intermediate queues fill up. Your in-memory queue serves no purpose at all, other than complicating things. You can simply process items from the reader one by one and if your rate of processing is too slow, the data reader will cause flow control to be applied on the query running on the server. This happens automatically simply because you don't call the DataReader.Read method.
To summarise up, for a large set processing you cannot do asynchronous processing and there is no need for a queue.
Now the difficult part.
Is your processing doing any sort of update back in the database? If yes, then you have much bigger problems:
Upvotes: 8
Reputation: 5241
A good approach to this would be to pull back the data in blocks, iterate through adding to your queue then calling again. This is going to be better than hitting the DB for each row. If you are pulling them back via a numeric PK then this will be easy, if you need to order by something you can use ROW_NUMBER() to do this.
Upvotes: 0