How to improve performance of String.split?

Question

I have large (100GB) CSV files that I periodically want to convert/reduce.

Each line is separated by a newline, where each line represents a record that I want to convert into a new record.

I therefore first have to split the line, extract my desired fields (+ convert some with logic), and save again as a reduced csv.

Each line contains 2500-3000 chars separated by a ^, with about 1000 fields. But I need only approx 120 fields from that split.

I discovered that the actual line splitting takes most of the processing time. I'm thus trying to find the best split method for one line.

My benchmark on a sample size of 10GB is as follows:

approx 5mins:

static final Pattern.compile("\^");
String[] split = pattern.split(line);

2min37s:

String split[] = string.split("\^");

2min18s (approx 12% faster than string.split()):

StringTokenizer tokenizer = new StringTokenizer(line, "\^");
String[] split = new String[tokenizer.countTokens()];
int j = 0;
while (tokenizer.hasMoreTokens()) {
    split[j] = tokenizer.nextToken();
    j++;
}

Question: could there be more room for improvement?

How to improve performance of String.split?

Answers (1)

Related Questions