Reputation: 191
For the sake of this question, let's assume I have a String
which contains the values Two;.Three;.Four
(and so on) but the elements are separated by ;.
.
Now I know there are multiple ways of splitting a string such as split()
and StringTokenizer
(being the faster one and works well) but my input file is around 1GB and I am looking for something slightly more efficient than StringTokenizer
.
After some research, I found that indexOf
and substring
are quite efficient but the examples only have single delimiters or results are returning only a single word/element.
Sample code using indexOf
and substring
:
String s = "quick,brown,fox,jumps,over,the,lazy,dog";
int from = s.indexOf(',');
int to = s.indexOf(',', from+1);
String brown = s.substring(from+1, to);
The above works for printing brown
but how can I use indexOf
and substring
to split a line with multiple delimiters and display all the items as below.
Expected output
Two
Three
Four
....and so on
Upvotes: 11
Views: 14608
Reputation: 7519
This is the method I use for splitting large (1GB+) tab-separated files. It is limited to a char
delimiter to avoid any overhead of additional method invocations (which may be optimized out by the runtime), but it can be easily converted to String-delimited. I'd be interested if anyone can come up with a faster method or improvements on this method.
public static String[] split(final String line, final char delimiter)
{
CharSequence[] temp = new CharSequence[(line.length() / 2) + 1];
int wordCount = 0;
int i = 0;
int j = line.indexOf(delimiter, 0); // first substring
while (j >= 0)
{
temp[wordCount++] = line.substring(i, j);
i = j + 1;
j = line.indexOf(delimiter, i); // rest of substrings
}
temp[wordCount++] = line.substring(i); // last substring
String[] result = new String[wordCount];
System.arraycopy(temp, 0, result, 0, wordCount);
return result;
}
Upvotes: 7
Reputation: 191
StringTokenizer
is faster than StringBuilder
.
public static void main(String[] args) {
String str = "This is String , split by StringTokenizer, created by me";
StringTokenizer st = new StringTokenizer(str);
System.out.println("---- Split by space ------");
while (st.hasMoreElements()) {
System.out.println(st.nextElement());
}
System.out.println("---- Split by comma ',' ------");
StringTokenizer st2 = new StringTokenizer(str, ",");
while (st2.hasMoreElements()) {
System.out.println(st2.nextElement());
}
}
Upvotes: 4
Reputation: 311054
If you want the ultimate in efficiency I wouldn't use Strings
at all, let alone split them. I would do what compilers do: process the file a character at a time. Use a BufferedReader
with a large buffer size, say 128kb, and read a char
at a time, accumulating them into say a StringBuilder
until you get a ;
or line terminator.
Upvotes: 5