Joe Leffrey
Joe Leffrey

Reputation: 183

Reading a csv file with millions of row via java as fast as possible

I want to read a csv files including millions of rows and use the attributes for my decision Tree algorithm. My code is below:

String csvFile = "myfile.csv";
List<String[]> rowList = new ArrayList();
String line = "";
String cvsSplitBy = ",";
String encoding = "UTF-8";
BufferedReader br2 = null;
try {
    int counterRow = 0;
    br2 =  new BufferedReader(new InputStreamReader(new FileInputStream(csvFile), encoding));
    while ((line = br2.readLine()) != null) { 
        line=line.replaceAll(",,", ",NA,");
        String[] object = line.split(cvsSplitBy);
        rowList.add(object); 
        counterRow++;
    }
    System.out.println("counterRow is: "+counterRow);
    for(int i=1;i<rowList.size();i++){
        try{
           //this method includes many if elses only.
           ImplementDecisionTreeRulesFor2012(rowList.get(i)[0],rowList.get(i)[1],rowList.get(i)[2],rowList.get(i)[3],rowList.get(i)[4],rowList.get(i)[5],rowList.get(i)[6]); 
        }
        catch(Exception ex){
           System.out.printlnt("Exception occurred");   
        }
    }
}
catch(Exception ex){
    System.out.println("fix"+ex);
}

It is working fine when the size of the csv file is not large. However, it is large indeed. Therefore I need another way to read a csv faster. Is there any advice? Appreciated, thanks.

Upvotes: 18

Views: 32422

Answers (4)

ThomasRS
ThomasRS

Reputation: 8287

If you're aiming for objects (i.e. data-binding), I've written a high-performance library sesseltjonna-csv you might find interesting. Benchmark comparison with SimpleFlatMapper and uniVocity here.

Upvotes: 0

Jeronimo Backes
Jeronimo Backes

Reputation: 6289

Just use uniVocity-parsers' CSV parser instead of trying to build your custom parser. Your implementation will probably not be fast or flexible enough to handle all corner cases.

It is extremely memory efficient and you can parse a million rows in less than a second. This link has a performance comparison of many java CSV libraries and univocity-parsers comes on top.

Here's a simple example of how to use it:

CsvParserSettings settings = new CsvParserSettings(); // you'll find many options here, check the tutorial.
CsvParser parser = new CsvParser(settings);

// parses all rows in one go (you should probably use a RowProcessor or iterate row by row if there are many rows)
List<String[]> allRows = parser.parseAll(new File("/path/to/your.csv"));

BUT, that loads everything into memory. To stream all rows, you can do this:

String[] row;
parser.beginParsing(csvFile)
while ((row = parser.parseNext()) != null) {
    //process row here.
}

The faster approach is to use a RowProcessor, it also gives more flexibility:

settings.setRowProcessor(myChosenRowProcessor);
CsvParser parser = new CsvParser(settings);
parser.parse(csvFile);

Lastly, it has built-in routines that use the parser to perform some common tasks (iterate java beans, dump ResultSets, etc)

This should cover the basics, check the documentation to find the best approach for your case.

Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Upvotes: 14

user3996996
user3996996

Reputation: 342

on top of the aforementioned univocity it's worth checking

the 3 of them would as the time of the comment the fastest csv parser.

Chance is that writting your own parser would be slower and buggy.

Upvotes: 1

laune
laune

Reputation: 31290

In this snippet I see two issues which will slow you down considerably:

while ((line = br2.readLine()) != null) { 
    line=line.replaceAll(",,", ",NA,");
    String[] object = line.split(cvsSplitBy);
    rowList.add(object); 
    counterRow++;
}

First, rowList starts with the default capacity and will have to be increased many times, always causing a copy of the old underlying array to the new.

Worse, however, ist the excessive blow-up of the data into a String[] object. You'll need the columns/cells only when you call ImplementDecisionTreeRulesFor2012 for that row - not all the time while you read that file and process all the other rows. Move the split (or something better, as suggested by comments) to the second row.

(Creating many objects is bad, even if you can afford the memory.)

Perhaps it would be better to call ImplementDecisionTreeRulesFor2012 while you read the "millions"? It would avoid the rowList ArrayList altogether.

Later Postponing the split reduces the execution time for 10 million rows from 1m8.262s (when the program ran out of heap space) to 13.067s.

If you aren't forced to read all rows before you can call Implp...2012, the time reduces to 4.902s.

Finally writing the split and replace by hand:

String[] object = new String[7];
//...read...
    String x = line + ",";
    int iPos = 0;
    int iStr = 0; 
    int iNext = -1;
    while( (iNext = x.indexOf( ',', iPos )) != -1 && iStr < 7 ){
        if( iNext == iPos ){
            object[iStr++] = "NA";
        } else {
             object[iStr++] = x.substring( iPos, iNext );
        }
        iPos = iNext + 1;
    }
    // add more "NA" if rows can have less than 7 cells

reduces the time to 1.983s. This is about 30 times faster than the original code, which runs into OutOfMemory anyway.

Upvotes: 11

Related Questions