Reputation: 607
I need to find out the maximum length for each field in a csv file .
For example in the following cvs
file because
1) var1
has the longest string shj
which is 3 characters ,
2) var2
has the longest string asdf
- 4 characters
3) var3
has asddfs
- 6 characters
var1,var2,var3
a,asdf,df
aa,,
shj,,asddfs
So , the result I need is the array int [] maxLength = {3,4,6}
So far I am using CSV Reader API . Here is my code
CSVReader reader = new CSVReader (new FileReader(Looks.fileName));
int [] maxLength = new int[reader.readNext().length];
for (int i = 0; i < row.length; i++) {
maxLength[i] = row[i].trim().length() ;
}
while ((row = reader.readNext()) !=null ) {
for (int i = 0; i < row.length; i++) {
maxLength[i] = Math.max(maxLength[i] , row[i].trim().length() );
}
}
reader.close();
It works fine . But too slow for a huge file . I have around 100,000,000 rows .
Is there any efficient way to do this ? Can I use setAsciiStream
somehow to estimate the length more efficiently ?
Upvotes: 0
Views: 2591
Reputation: 41263
Your code is about as efficient as it could be - it reads each byte once and once only, and it doesn't do any expensive seeking around the file.
It's possible that wrapping the FileReader
in a BufferedReader
could improve performance -- although it's not unlikely that CSVReader
uses a BufferedReader
internally.
There are physical limits on how quickly you can read 100,000,000 rows from disk. It's worth benchmarking the simplest program you can write which reads through a whole file, to see how long that takes, before deciding that your CSV reader is slow.
BufferedReader reader = new BufferedReader(new FileReader filename);
char[] buffer = new char[1024**1024*10]; // 10MB; whatever
while(reader.read(buffer,0,buffer.length) >= 0) {
// nothing
}
reader.close();
Update: confirmed my suspicions, assuming you are using OpenCSV.
Here is the source for OpenCsv: http://sourceforge.net/p/opencsv/code/HEAD/tree/trunk/src/au/com/bytecode/opencsv/
The constructor for CSVReader
wraps the Reader
in a BufferedReader
if it is not already a BufferedReader
.
CSVReader.readNext()
simply calls BufferedReader.readLine()
repeatedly and does some pretty basic manipulation on the chars is acquires that way.
This is the fastest way of reading through a file: start at the beginning, read until you get to the end, using a buffer so that your underlying disk reads are the size the hardware and device drivers prefer.
Run the program above on a large file, and you'll find it takes about the same amount of time as your CSV parsing program -- because even though mine doesn't do any appreciable processing, it's has the same bottleneck as yours - the speed of reading from disk.
Indeed cat largefile >/dev/null
(UNIX) or type largefile >NUL
will take a similar time.
Run your code with a profiler and you'll find that it's spending more of its time waiting on read()
(in a native method that's part of core Java) than anywhere else.
You can't do anything to your Java program to speed this up. You might be able to speed it up by tweaking the hardware and/or operating system - things like tuning filesystem parameters and driver settings, putting the file on RAMdisk or SSD, and so on.
Upvotes: 1
Reputation: 3058
Is CSVReader
buffered? If not, wrap your FileReader
with a BufferedReader
(and make it a nice large buffer size).
Upvotes: 2