Reputation: 123
I have a folder with 3000 csv files ranging in size from 1Kb to 100kb. Every row in these files are 43 characters long. They have a total size of 171Mb.
I am trying to write a program to parse these files as fast as I can.
I initially tried my own implementation, but was not happy with these results. I then found LumenWorks.Framework.IO.Csv on StackOverflow. It has bold claims:
To give more down-to-earth numbers, with a 45 MB CSV file containing 145 fields and 50,000 records, the reader was processing about 30 MB/sec. So all in all, it took 1.5 seconds! The machine specs were P4 3.0 GHz, 1024 MB.
I simply don't get anything near those results. My process takes >>10min. Is this because it isn't one big stream, but lots of small files and there is an overhead there? Is there anything else I could be doing?
I feel the LumenWorks implementation was no faster than my own (I haven't benchmarked), not to mention it handles quotes, escaping, comments and multi-lined fields, none of which I need. I have a very regular format of comma separated integers.
Cheers
Upvotes: 4
Views: 1642
Reputation: 3061
Have you tried using LogParser? I'm not sure that it will be any faster, but I have had success in some scenarios. It might be worth a quick test.
Where it might be faster is in reading from lots of small CSV's like in your example. Anyway regardless you should benchmark your own code so that you can compare it to both lumens and logparser (and any other suggestions). assumptions are bad.
Upvotes: 0
Reputation: 942408
CSV file parsing is I/O bound, determined by how fast you can read the data off the disk. The fastest that could ever go is around 50 to 60 MB per second for a consumer level hard drive. Sounds like this LumenWorks is close to that limit.
You'll only ever get this kind of throughput though on a nice clean unfragmented disk with one large file. So that the disk reader head is just pumping data without having to move a lot, just track-to-track moves. Moving the head is the slow part, usually around 16 milliseconds average.
There's lots of head movement when you're reading 3000 files. Just opening a file takes about 50 milliseconds. At least do a comparable test to find the bottleneck. Use a good text editor and copy/paste to make one giant file as well. Run a disk defragger first, Defraggler is a decent free one.
As far as code improvements, watch out for strings. They can generate a lot of garbage and have poor CPU cache locality. Threads can't make I/O bound code faster. The only possible improvement is one thread that reads the file, another that does the conversion so that reading and converting is overlapped. Having more than one thread doing the reading is pointless, they'll just take turns waiting for the disk.
And watch out for the file system cache. The second time you run a test on the same file, you'll get the data from memory, not the disk. That's fast but won't tell you how it will perform in production.
Upvotes: 5
Reputation: 27377
Try reading the files in on separate threads. If the data needs to be read synchronously, you could possibly try creating threads to handle opening/closing of file handles and implement a queue to actually parse the data in a single thread
Upvotes: 0
Reputation: 752
Do all the files "appear" at once to be processed? Could you not maybe merge them incrementally as they "arrive" into one file which gets processed by your program? 10 mins is a long time to process +/-7MB of data (worst case scenario from the numbers you quoted) ?
Upvotes: 0