Andrei Vasilev
Andrei Vasilev

Reputation: 607

How to get the maximum length of each field in a csv file?

I need to find out the maximum length for each field in a csv file .

For example in the following cvs file because 1) var1 has the longest string shj which is 3 characters ,

2) var2 has the longest string asdf - 4 characters

3) var3 has asddfs - 6 characters

var1,var2,var3
a,asdf,df
aa,,
shj,,asddfs

So , the result I need is the array int [] maxLength = {3,4,6}


So far I am using CSV Reader API . Here is my code

CSVReader reader = new CSVReader (new FileReader(Looks.fileName));
        int [] maxLength = new int[reader.readNext().length];
        for (int i = 0; i < row.length; i++) {
            maxLength[i] = row[i].trim().length() ;
        }
        while ((row = reader.readNext()) !=null ) {
            for (int i = 0; i < row.length; i++) {
                maxLength[i] = Math.max(maxLength[i] , row[i].trim().length() );
            }
        }
        reader.close(); 

It works fine . But too slow for a huge file . I have around 100,000,000 rows .

Is there any efficient way to do this ? Can I use setAsciiStream somehow to estimate the length more efficiently ?

Upvotes: 0

Views: 2591

Answers (2)

slim
slim

Reputation: 41263

Your code is about as efficient as it could be - it reads each byte once and once only, and it doesn't do any expensive seeking around the file.

It's possible that wrapping the FileReader in a BufferedReader could improve performance -- although it's not unlikely that CSVReader uses a BufferedReader internally.

There are physical limits on how quickly you can read 100,000,000 rows from disk. It's worth benchmarking the simplest program you can write which reads through a whole file, to see how long that takes, before deciding that your CSV reader is slow.

BufferedReader reader = new BufferedReader(new FileReader filename);
char[] buffer = new char[1024**1024*10]; // 10MB; whatever
while(reader.read(buffer,0,buffer.length) >= 0) {
     // nothing
}
reader.close();

Update: confirmed my suspicions, assuming you are using OpenCSV.

Here is the source for OpenCsv: http://sourceforge.net/p/opencsv/code/HEAD/tree/trunk/src/au/com/bytecode/opencsv/

The constructor for CSVReader wraps the Reader in a BufferedReader if it is not already a BufferedReader.

CSVReader.readNext() simply calls BufferedReader.readLine() repeatedly and does some pretty basic manipulation on the chars is acquires that way.

This is the fastest way of reading through a file: start at the beginning, read until you get to the end, using a buffer so that your underlying disk reads are the size the hardware and device drivers prefer.

Run the program above on a large file, and you'll find it takes about the same amount of time as your CSV parsing program -- because even though mine doesn't do any appreciable processing, it's has the same bottleneck as yours - the speed of reading from disk.

Indeed cat largefile >/dev/null (UNIX) or type largefile >NUL will take a similar time.

Run your code with a profiler and you'll find that it's spending more of its time waiting on read() (in a native method that's part of core Java) than anywhere else.

You can't do anything to your Java program to speed this up. You might be able to speed it up by tweaking the hardware and/or operating system - things like tuning filesystem parameters and driver settings, putting the file on RAMdisk or SSD, and so on.

Upvotes: 1

Paul
Paul

Reputation: 3058

Is CSVReader buffered? If not, wrap your FileReader with a BufferedReader (and make it a nice large buffer size).

Upvotes: 2

Related Questions