Robert Oschler
Robert Oschler

Reputation: 14375

Really fast Java function to split strings without affecting quoted strings?

I need a string splitting function that is really fast that will break apart a comma delimited string without breaking apart strings encased in double-quotes that contain commas. Is there a function that does this? If it's best handled by a regular expression, please indicate the necessary pattern and if applicable, tell me any speed optimization tips I should know about. For example, if there's a way to invoke the regular expression in such a way that does not require the regular expression pattern to be reevaluated every time, etc. This function will be called thousands of times in a short period of time.

Note, I did see the regular expression posts on SO like this one:

Regular Expression To Split On Comma Except If Quoted

But they were C# and other languages and not Java. Also, if there is a non-regular expression method that is faster I'd like to know about it as I indicated above.

-- roschler

Upvotes: 0

Views: 1068

Answers (4)

Matt
Matt

Reputation: 11805

There's also StrTokenizer in the commons-lang library as well:

StrTokenizer tokenizer = StrTokenizer.getCSVInstance();
tokenizer.reset(input);
String tokens[] = tokenizer.getTokenArray();

There's also a method to get tokens as a list, and it implements The Iterator/ListIterator functions so you can use it in an iterator style while loop.

You can also keep calling the "reset" method to clear the instance, and parse new input data.

One thing to note is that OpenCSV words with Reader instances, and will parse across multiple lines. This class works with strings or char arrays and parses only a single record. It does have some memory overhead in that all the parsing is done upfront when you ask for the first token.

It is however, more configurable than OpenCSV.

DISCLOSURE: I contributed the original version of this class to the library.

Upvotes: 0

Ray Toal
Ray Toal

Reputation: 88378

I think the most popular libraries for Java that do this naturally are supercsv and opencsv. Are you looking for a non-library solution?

Upvotes: 1

Mark Elliot
Mark Elliot

Reputation: 77024

You can basically rip off the C# code from the linked question, but you need to undo it's iterator stuff, replacing yield return with, say, appending to a list:

public static List<String> SplitCSV(String csvString)
    StringBuilder sb = new StringBuilder();
    boolean quoted = false;

    List<String> list = new ArrayList<String>();

    for(char c : csvString.toCharArray()) {
        if (quoted) {
            if (c == '"')
                quoted = false;
            else
                sb.append(c);
        } else {
            if (c == '"') {
                quoted = true;
            } else if (c == ',') {
                list.add(sb.toString());
                sb = new StringBuilder();
            } else {
                sb.append(c);
            }
        }
    }

    if (quoted)
        throw new IllegalArgumentException("csvString: Unterminated quotation mark.");

    list.add(sb.toString());
    return list;
}

Note that this, of course, won't deal with escaped quotes in quoted strings...

Upvotes: 2

Moe Matar
Moe Matar

Reputation: 2054

It sounds like you are trying to parse a CSV formatted strings/files?

If so, maybe you don't have to write the code yourself. Checkout the apache commons library for CSV parsing:

http://commons.apache.org/sandbox/csv/

Upvotes: 6

Related Questions