tequilaras
tequilaras

Reputation: 279

java pattern with tab characters

i have a file with lines like:

string1 (tab) sting2 (tab) string3 (tab) string4

I want to get from every line, string3... All i now from the lines is that string3 is between the second and the third tab character. is it possible to take it with a pattern like

Pattern pat = Pattern.compile(".\t.\t.\t.");

Upvotes: 1

Views: 10460

Answers (3)

ewan.chalmers
ewan.chalmers

Reputation: 16235

If you want a regex which captures the third field only and nothing else, you could use the following:

String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
Pattern pattern = Pattern.compile(regex);

Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
  System.err.println(matcher.group(1));
}

I don't know whether this would perform any better than split("\\t") for parsing a large file.

UPDATE

I was curious to see how the simple split versus the more explicit regex would perform, so I tested three different parser implementations.

/** Simple split parser */
static class SplitParser implements Parser {
    public String parse(String line) {
        String[] fields = line.split("\\t");
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Split parser, but with compiled pattern */
static class CompiledSplitParser implements Parser {
    private static final String regex = "\\t";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        String[] fields = pattern.split(line);
        if (fields.length == 4) {
            return fields[2];
        }
        return null;
    }
}

/** Regex group parser */
static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    public String parse(String line) {
        Matcher m = pattern.matcher(line);
        if (m.matches()) {
            return m.group(1);
        }
        return null;
    }
}

I ran each ten times against the same million line file. Here are the average results:

  • split: 2768.8 ms
  • compiled split: 1041.5 ms
  • group regex: 1015.5 ms

The clear conclusion is that it is important to compile your pattern, rather than rely on String.split, if you are going to use it repeatedly.

The result on compiled split versus group regex is not conclusive based on this testing. And probably the regex could be tweaked further for performance.

UPDATE

A further simple optimization is to re-use the Matcher rather than create one per loop iteration.

static class RegexParser implements Parser {
    private static final String regex = "(?:[^\\t]*)\\t(?:[^\\t]*)\\t([^\\t]*)\\t(?:[^\\t]*)";
    private static final Pattern pattern = Pattern.compile(regex);

    // Matcher is not thread-safe...
    private Matcher matcher = pattern.matcher("");

    // ... so this method is no-longer thread-safe
    public String parse(String line) {
        matcher = matcher.reset(line);
        if (matcher.matches()) {
            return matcher.group(1);
        }
        return null;
    }
}

Upvotes: 3

Jon Skeet
Jon Skeet

Reputation: 1501163

It sounds like you just want:

foreach (String line in lines) {
    String[] bits = line.split("\t");
    if (bits.length != 4) {
        // Handle appropriately, probably throwing an exception
        // or at least logging and then ignoring the line (using a continue
        // statement)
    }
    String third = bits[2];
    // Use...
}

(You can escape the string so that the regex engine has to parse the backslash-t as tab, but you don't have to. The above works fine.)

Another alternative to the built-in String.split method using a regex is the Guava Splitter class. Probably not necessary here, but worth being aware of.

EDIT: As noted in comments, if you're going to repeatedly use the same pattern, it's more efficient to compile a single Pattern and use Pattern.split:

private static final Pattern TAB_SPLITTER = Pattern.compile("\t");

...

String[] bits = TAB_SPLITTER.split(line);

Upvotes: 5

erimerturk
erimerturk

Reputation: 4288

String string3 = tempValue.split("\\t")[2];

Upvotes: 6

Related Questions