Slow reading from CSV file

Question

I'm trying to read from a csv file but it's slow. Here's the code roughly explained:

private static Film[] readMoviesFromCSV() {
    // Regex to split by comma without splitting in double quotes.
    // https://regexr.com/3s3me <- example on this data
    var pattern = Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
    Film[] films = null;
    try (var br = new BufferedReader(new FileReader(FILENAME))) {
        var start = System.currentTimeMillis();
        var temparr = br.lines().skip(1).collect(Collectors.toList());  // skip first line and read into List
        films = temparr.stream().parallel()
                .map(pattern::split)
                .filter(x -> x.length == 24 && x[7].equals("en")) // all fields(total 24) and english speaking movies
                .filter(x -> (x[14].length() > 0)) // check if it has x[14] (date)
                .map(movieData -> new Film(movieData[8], movieData[9], movieData[14], movieData[22], movieData[23], movieData[7]))
                // movieData[8] = String title, movieData[9] = String overview
                // movieData[14] = String date (constructor parses it to LocalDate object)
                // movieData[22] = String avgRating
                .toArray(Film[]::new);
        System.out.println(MessageFormat.format("Execution time: {0}", (System.currentTimeMillis() - start)));
        System.out.println(films.length);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    return films;
}

File is about 30 MB big and it takes about 3-4 seconds avg. I'm using streams but it's still really slow. Is it because of that splitting each time?

EDIT: I've managed to speed up reading and processing time by 3x with uniVocity-parsers library. On average it takes 950 ms to finish. That's pretty impressive.

private static Film[] readMoviesWithLib() {
    Film[] films = null;
    CsvParserSettings parserSettings = new CsvParserSettings();
    parserSettings.setLineSeparatorDetectionEnabled(true);
    RowListProcessor rowProcessor = new RowListProcessor();
    parserSettings.setProcessor(rowProcessor);
    parserSettings.setHeaderExtractionEnabled(true);
    CsvParser parser = new CsvParser(parserSettings);
    var start = System.currentTimeMillis();
    try {
        parser.parse(new BufferedReader(new FileReader(FILENAME)));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
    List rows = rowProcessor.getRows();
    films = rows.stream()
            .filter(Objects::nonNull)
            .filter(x -> x.length == 24 && x[14] != null && x[7] != null)
            .filter(x -> x[7].equals("en"))
            .map(movieData -> new Film(movieData[8], movieData[9], movieData[14], movieData[22], movieData[23], movieData[7]))
            .toArray(Film[]::new);
    System.out.printf(MessageFormat.format("Time: {0}",(System.currentTimeMillis()-start)));
    return films;
}

Jeronimo Backes · Accepted Answer

Author of the univocity-parsers library here. You can speed up the code you posted in your edit a little bit further by rewriting it like this:

    //initialize an arraylist with a good size to avoid reallocation
    final ArrayList films = new ArrayList(20000);
    CsvParserSettings parserSettings = new CsvParserSettings();
    parserSettings.setLineSeparatorDetectionEnabled(true);
    parserSettings.setHeaderExtractionEnabled(true);

    //don't generate strings for columns you don't want
    parserSettings.selectIndexes(7, 8, 9, 14, 22, 23);

    //keep generating rows with the same number of columns found in the input
    //indexes not selected will have nulls as they are not processed.
    parserSettings.setColumnReorderingEnabled(false); 

    parserSettings.setProcessor(new AbstractRowProcessor(){
        @Override
        public void rowProcessed(String[] row, ParsingContext context) {
            if(row.length == 24 && "en".equals(row[7]) && row[14] != null){
                films.add(new Film(row[8], row[9], row[14], row[22], row[23], row[7]));
            }
        }
    });

    CsvParser parser = new CsvParser(parserSettings);
    long start = System.currentTimeMillis();
    try {
        parser.parse(new File(FILENAME), "UTF-8"); 
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }

    System.out.printf(MessageFormat.format("Time: {0}",(System.currentTimeMillis()-start)));
    return films.toArray(new Film[0]);

For convenience, if you have to process stuff into different classes you can also use annotations in your Film class.

Hope this helps.

Slow reading from CSV file

Answers (1)

Related Questions