jlp
jlp

Reputation: 1706

How to break a file into tokens based on regex using Java

I have a file in the following format, records are separated by newline but some records have line feed in them, like below. I need to get each record and process them separately. The file could be a few Mb in size.

 <?aaaaa>
 <?bbbb
     bb>
 <?cccccc>

I have the code:

 FileInputStream fs = new FileInputStream(FILE_PATH_NAME);
 Scanner scanner = new Scanner(fs);
 scanner.useDelimiter(Pattern.compile("<\\?"));
 if (scanner.hasNext()) {
     String line = scanner.next();
     System.out.println(line);
 } 
 scanner.close();

But the result I got have the begining <\? removed:

aaaaa>
bbbb
   bb>
cccccc>

I know the Scanner consumes any input that matches the delimiter pattern. All I can think of is to add the delimiter pattern back to each record mannully.

Is there a way to NOT have the delimeter pattern removed?

Upvotes: 3

Views: 131

Answers (3)

Bohemian
Bohemian

Reputation: 425003

Break on a newline only when preceded by a ">" char:

scanner.useDelimiter("(?<=>)\\R"); // Note you can pass a string directly

\R is a system independent newline
(?<=>) is a look behind that asserts (without consuming) that the previous char is a >

Plus it's cool because <=> looks like Darth Vader's TIE fighter.

Upvotes: 5

ninja.coder
ninja.coder

Reputation: 9648

Here is one way of doing it by using a StringBuilder:

public static void main(String[] args) throws FileNotFoundException {
    Scanner in = new Scanner(new File("C:\\test.txt"));
    StringBuilder builder = new StringBuilder();

    String input = null;
    while (in.hasNextLine() && null != (input = in.nextLine())) {
        for (int x = 0; x < input.length(); x++) {
            builder.append(input.charAt(x));
            if (input.charAt(x) == '>') {
                System.out.println(builder.toString());
                builder = new StringBuilder();
            }
        }
    }

    in.close();
}

Input:

 <?aaaaa>
 <?bbbb
     bb>
 <?cccccc>

Output:

 <?aaaaa>
 <?bbbb     bb>
 <?cccccc>

Upvotes: 0

KarelPeeters
KarelPeeters

Reputation: 1465

I'm assuming you want to ignore the newline character '\n' everywhere.

I would read the whole file into a String and then remove all of the '\n's in the String. The part of the code this question is about looks like this:

String fileString = new String(Files.readAllBytes(Paths.get(path)), StandardCharsets.UTF_8);
fileString = fileString.replace("\n", "");
Scanner scanner = new Scanner(fileString);
...  //your code

Feel free to ask any further questions you might have!

Upvotes: 1

Related Questions