membersound
membersound

Reputation: 86925

How to improve performance of String.split?

I have large (100GB) CSV files that I periodically want to convert/reduce.

Each line is separated by a newline, where each line represents a record that I want to convert into a new record.

I therefore first have to split the line, extract my desired fields (+ convert some with logic), and save again as a reduced csv.

Each line contains 2500-3000 chars separated by a ^, with about 1000 fields. But I need only approx 120 fields from that split.

I discovered that the actual line splitting takes most of the processing time. I'm thus trying to find the best split method for one line.

My benchmark on a sample size of 10GB is as follows:

approx 5mins:

static final Pattern.compile("\\^");
String[] split = pattern.split(line);

2min37s:

String split[] = string.split("\\^");

2min18s (approx 12% faster than string.split()):

StringTokenizer tokenizer = new StringTokenizer(line, "\\^");
String[] split = new String[tokenizer.countTokens()];
int j = 0;
while (tokenizer.hasMoreTokens()) {
    split[j] = tokenizer.nextToken();
    j++;
}

Question: could there be more room for improvement?

Upvotes: 0

Views: 874

Answers (1)

Nowhere Man
Nowhere Man

Reputation: 19575

If the first 120 fields/columns in a line/row are relevant, it may be better to use a "limited" edition of String::split(String delim, int limit) which will split only for these ~10% of the columns.

// keep relevant 120 fields + 1 field for the tail part
String[] properFields = string.split("\\^", 121); 

Update

If there is a prefix of N fields to be skipped and the next M fields need to be processed, a regular expression may be used to define parts of the line and then only the important part should be picked up.

The latter may be implemented using Stream<MatchResult> retrieved via Scanner or Matcher

String str = "aaa^bbb^ccc^11^22^33^44^55^xyz\r\npp^qqq^sss^99^88^77^66^55^$313";

// N = M = 3
Pattern line = Pattern.compile("(?<prefix>(([^^\\r\\n]+\\^){3}))(?<body>(([^^\\r\\n]+\\^?){0,3})).*\\R?");
Pattern field = Pattern.compile("\\^"); // caret-separated fields

System.out.println("matcher results");
line.matcher(str)
    .results()
    .map(mr -> mr.group(4))
    .flatMap(s -> Arrays.stream(s.split("\\^")))
    .forEach(System.out::println);

System.out.println("scanner findall");
Scanner scan = new Scanner(str);
scan.findAll(line)
    .map(mr -> mr.group(4))
    .flatMap(field::splitAsStream)
    .forEach(System.out::println);

Output:

matcher results
11
22
33
99
88
77
scanner findall
11
22
33
99
88
77

However, regular expressions may also affect performance, so the simplest way to handle large strings may be to implement custom method to return substring between N and N + M occurrence of the delimiter:

public static String substring(char delim, int from, int to, String line) {
    int index = 0;
    int count = 0;
    int n = line.length();
    for (; index < n && count < from; index++) {
        if (line.charAt(index) == delim) {
            count++;
        }
    }
    int indexFrom = index;
    for (; index < n && count < to; index++) {
        if (line.charAt(index) == delim) {
            count++;
        }
    }
    return line.substring(indexFrom, index);
}

System.out.println("scanner plain");
scan = new Scanner(str);
while(scan.hasNextLine()) {
    System.out.println(substring('^', 3, 6, scan.nextLine()));
}

Output:

scanner plain
11^22^33^
99^88^77^

Upvotes: 1

Related Questions