Kambiz
Kambiz

Reputation: 321

How to write a regular expressions that extracts tabbed pieces of text?

I have been trying to create a program to replace tab elements with spaces (assuming a tab is equivalent to 8 spaces, one or more of which taken by non-whitespace characters (letter).

I start to extract the text in a file from a scanner by the following:

try {
    reader = new FileReader(file)
} catch (IOException io) {
    println("File not found")
}
Scanner scanner = new Scanner(reader);
scanner.usedelimiter("//Z");
String text = Scanner.next();

And then I try parsing through pieces of text that end with a tab with ptrn1 below, and extract the length of the last word of each piece with ptrn2:

Pattern ptrn1 = Pattern.compile(".*\\t, Pattern.DOTALL);
Matcher matcher1 = ptrn1.matcher(text);
String nextPiece = matcher1.group();
println(matcher1.group()); /* gives me the first substring ending with tab*/

however:

Pattern ptrn2 = Pattern.compile("\\s.*\\t"); /*supposed to capture the last word in the string*/
Matcher matcher2 = ptrn2.matcher(nextPiece);
String lastword = matcher2.group();

The last line gives me an error since apparently it cannot match anything with the pattern ("\\s.\*\\t"). There is something wrong with this last regular expression, which is intended to say "any number of spaces, followed by any number of characters, followed by a tab. I have not been able to find out what is wrong with it though. I have tried ("\\s*.+\\t"), ("\\s*.*\\t"), and ("\s+.+\\t"); still no luck.

Later on, per recommendations below, I simplified the code and included the sample string in it. As follows:

       import acm.program.*;
       import acm.util.*;
       import java.util.*;
       import java.io.*;
       import java.util.regex.*;

    public class Untabify extends ConsoleProgram {
        public void run(){
            String s = "Be plain,\tgood son,\tand homely\tin thy drift.\tRiddling\tconfession\tfinds but riddling\tshrift. ";            
                Pattern ptrn1 =Pattern.compile(".*?\t", Pattern.DOTALL);
                Pattern ptrn2 = Pattern.compile("[^\\s+]\t", Pattern.DOTALL);

                String nextPiece;

                Matcher matcher1 = ptrn1.matcher(s);

                while (matcher1.find()){                
                    nextPiece = matcher1.group();
                    println(nextPiece);
                    Matcher matcher2 = ptrn2.matcher(nextPiece);
                    println(matcher2.group());

               }
            }
    }

The program variably crashes, first at "println(matcher2.group())"; and on the next run on "public void run()" with the message: "Debug Current Instruction Pointer" (what is the meaning of it?).

Upvotes: 0

Views: 182

Answers (3)

Rangi Keen
Rangi Keen

Reputation: 945

The pattern "\\s.*\\t" must match a single whitespace character (\s) followed by 0 or more characters (.*) followed by a single tab (\t). If you want to capture the last word and a trailing tab you should use the word boundary escape \b

Pattern.compile("\\b.*\\b\t");

You could replace the . above to use \w or whatever your definition of a word character is if you don't want to match any character.

Here's the code you'd use to match any word immediately before a tab:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegEx {
    public static void main(String args[]) {
        String text = "ab cd\t ef gh\t ij";
        Pattern pattern = Pattern.compile("\\b(\\w+)\\b\t", Pattern.DOTALL);
        Matcher matcher = pattern.matcher(text);
        while (matcher.find()) {
            System.out.println(matcher.group(1));
        }
    }
}

The above will output

cd
gh

See the Regular Expression Tutorial, especially the sections on Predefined Character Classes and Boundary Matchers for more information.

You can get more detail and experiment with this regular expression on Regex101.

Upvotes: 1

The Guy with The Hat
The Guy with The Hat

Reputation: 11132

You do not need to double-escape the tab character (i.e. \\t); \t will do fine. \t is interpreted as a tab character by the java String parser, and that tab character is sent to the regex parser, which interprets it as a tab character. You can see this answer for more information.

Also, you should use Pattern.DOTALL, not Pattern.Dotall.

Upvotes: 1

acarlon
acarlon

Reputation: 17272

It would be useful to see a sample string. If you just want the last word before the tab, then you can use this:

([^\s]+)\t

Note the () are to put the last word in a group. [^\s]+ means 1 or more non-space.

Upvotes: 1

Related Questions