Reputation: 86925
I have large (100GB) CSV
files that I periodically want to convert/reduce.
Each line is separated by a newline, where each line represents a record that I want to convert into a new record.
I therefore first have to split the line, extract my desired fields (+ convert some with logic), and save again as a reduced csv.
Each line contains 2500-3000
chars separated by a ^
, with about 1000
fields.
But I need only approx 120
fields from that split.
I discovered that the actual line splitting takes most of the processing time. I'm thus trying to find the best split method for one line.
My benchmark on a sample size of 10GB is as follows:
approx 5mins:
static final Pattern.compile("\\^");
String[] split = pattern.split(line);
2min37s:
String split[] = string.split("\\^");
2min18s (approx 12% faster than string.split()):
StringTokenizer tokenizer = new StringTokenizer(line, "\\^");
String[] split = new String[tokenizer.countTokens()];
int j = 0;
while (tokenizer.hasMoreTokens()) {
split[j] = tokenizer.nextToken();
j++;
}
Question: could there be more room for improvement?
Upvotes: 0
Views: 874
Reputation: 19575
If the first 120 fields/columns in a line/row are relevant, it may be better to use a "limited" edition of String::split(String delim, int limit)
which will split only for these ~10% of the columns.
// keep relevant 120 fields + 1 field for the tail part
String[] properFields = string.split("\\^", 121);
Update
If there is a prefix of N fields to be skipped and the next M fields need to be processed, a regular expression may be used to define parts of the line and then only the important part should be picked up.
The latter may be implemented using Stream<MatchResult>
retrieved via Scanner
or Matcher
String str = "aaa^bbb^ccc^11^22^33^44^55^xyz\r\npp^qqq^sss^99^88^77^66^55^$313";
// N = M = 3
Pattern line = Pattern.compile("(?<prefix>(([^^\\r\\n]+\\^){3}))(?<body>(([^^\\r\\n]+\\^?){0,3})).*\\R?");
Pattern field = Pattern.compile("\\^"); // caret-separated fields
System.out.println("matcher results");
line.matcher(str)
.results()
.map(mr -> mr.group(4))
.flatMap(s -> Arrays.stream(s.split("\\^")))
.forEach(System.out::println);
System.out.println("scanner findall");
Scanner scan = new Scanner(str);
scan.findAll(line)
.map(mr -> mr.group(4))
.flatMap(field::splitAsStream)
.forEach(System.out::println);
Output:
matcher results
11
22
33
99
88
77
scanner findall
11
22
33
99
88
77
However, regular expressions may also affect performance, so the simplest way to handle large strings may be to implement custom method to return substring between N and N + M occurrence of the delimiter:
public static String substring(char delim, int from, int to, String line) {
int index = 0;
int count = 0;
int n = line.length();
for (; index < n && count < from; index++) {
if (line.charAt(index) == delim) {
count++;
}
}
int indexFrom = index;
for (; index < n && count < to; index++) {
if (line.charAt(index) == delim) {
count++;
}
}
return line.substring(indexFrom, index);
}
System.out.println("scanner plain");
scan = new Scanner(str);
while(scan.hasNextLine()) {
System.out.println(substring('^', 3, 6, scan.nextLine()));
}
Output:
scanner plain
11^22^33^
99^88^77^
Upvotes: 1