Reputation: 493
I have a 26 million rows dataset and when I try parsing it with uniVocity parser it reads it as 18 million rows only. My rows field count varies from 158 to 162 with delimiter as ASCII '\u0001'.
wc -l output from linux >>>> wc -l withHeader.dat 26351323 withHeader.dat
But parser reads it as Total # of rows in file = 18554088 ( output from list.size of parser.parseAll() )
Can some one explain what could be the issue ?
this is my parserSettings
settings.getFormat().setLineSeparator("\n");
settings.selectFields("acctId","tcat", "transCode");
settings.getFormat().setDelimiter('\u0001');
//settings.setAutoConfigurationEnabled(true);
//settings.setMaxColumns(86);
settings.setHeaderExtractionEnabled(false);
// creates a CSV parser
CsvParser parser = new CsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(newReader(filePath));
System.out.println("Total # of rows in file = " + allRows.size());
Upvotes: 1
Views: 306
Reputation: 6289
If your values can contain line separators, then the number of parsed records won't be equal to the number of lines.
If that's not the case, then it's likely you are not configuring the format correctly. You might need to configure quotes, quote escapes, etc.
My first suggestion is to try to detect the format automatically with:
settings.detectFormatAutomatically();
After parsing, check if you got the row count you expect to find. You can get what has been detected by calling:
CsvFormat detectedFormat = parser.getDetectedFormat();
Keep in mind this process is not guaranteed to work but in the majority of cases it does the trick. These features are available as of version 2.0.0.
If nothing helps, please attach (part of) your input file so I can take a look and update my answer.
Upvotes: 1