Reputation: 471
I have got this text (numerical values might change) :
.START_SEQUENCE RANDOM SENTENCE
3.40000
1 2 3 4 some text or not
4 3 8 9
.END_SEQUENCE
I want to get the following text (so basically find everything between .START_SEQUENCE and .END_SEQUENCE, but without neither the end of the START_SEQUENCE line nor the next one)
1 2 3 4 some text or not
4 3 8 9
I have played with Pattern.DOTALL, Pattern.MULTILINE, managed to get rid off things but never ending up on the exact selection I want. I have no clue how to move on.
Here is my last attempt.
final String START_SEQUENCE = "\\.START_SEQUENCE[^\n^\r]*";
final String END_SEQUENCE = "\\.END_SEQUENCE";
Pattern regex = Pattern.compile(START_SEQUENCE+"(.*)"+END_SEQUENCE, Pattern.DOTALL);
Matcher regexMatcher = regex.matcher(emn);
if (regexMatcher.find()) {
String ResultString = regexMatcher.group(1);
}
Which result is
3.40000
1 2 3 4 some text or not
4 3 8 9
Many thanks in advance !
Upvotes: 1
Views: 113
Reputation: 20163
A non-regex solution:
import java.util.ArrayList;
import java.io.File;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.LineIterator;
/**
<P>{@code java BetweenLineMarkersButSkipFirstXmpl C:\java_code\\xbn\z\xmpl\text\regex\BetweenLineMarkersButSkipFirstXmpl_data.txt}</P>
**/
public class BetweenLineMarkersButSkipFirstXmpl {
public static final void main(String[] as_1RqdTxtFilePath) {
LineIterator li = null;
try {
li = FileUtils.lineIterator(new File(as_1RqdTxtFilePath[0])); //Throws npx if null
} catch(IOException iox) {
throw new RuntimeException("Attempting to open \"" + as_1RqdTxtFilePath[0] + "\"", iox);
} catch(RuntimeException rtx) {
throw new RuntimeException("One required parameter: The path to the text file.", rtx);
}
String sLS = System.getProperty("line.separator", "\n");
ArrayList<String> alsItems = new ArrayList<String>();
boolean bStartMark = false;
boolean bLine1Skipped = false;
StringBuilder sdCurrentItem = new StringBuilder();
while(li.hasNext()) {
String sLine = li.next().trim();
if(!bStartMark) {
if(sLine.startsWith(".START_SEQUENCE")) {
bStartMark = true;
continue;
}
throw new IllegalStateException("Start mark not found.");
} if(!bLine1Skipped) {
bLine1Skipped = true;
continue;
} else if(!sLine.equals(".END_SEQUENCE")) {
sdCurrentItem.append(sLine).append(sLS);
} else {
alsItems.add(sdCurrentItem.toString());
sdCurrentItem.setLength(0);
bStartMark = false;
bLine1Skipped = false;
continue;
}
}
for(String s : alsItems) {
System.out.println("----------");
System.out.print(s);
}
}
}
Using this input:
.START_SEQUENCE RANDOM SENTENCE
3.40000
1 2 3 4
4 3 8 9
.END_SEQUENCE
.START_SEQUENCE RANDOM SENTENCE
3.40000
2 3 4 5
3 8 9 10
.END_SEQUENCE
Output:
[C:\java_code\]java BetweenLineMarkersButSkipFirstXmpl C:\java_code\BetweenLineMarkersButSkipFirstXmpl_data.txt
----------
1 2 3 4
4 3 8 9
----------
2 3 4 5
3 8 9 10
Upvotes: 1
Reputation: 56809
Use this regex with Pattern.UNIX_LINES
flag:
"\\.START_SEQUENCE.*\n.*\n((?:(?!\\.END_SEQUENCE).*\n)*+)\\.END_SEQUENCE"
Pattern.UNIX_LINES
makes .
equivalent to [^\n]
. Normally, it is [^\n\r\u0085\u2028\u2029]
.
Let us break down the regex (to make it easier to read, escape sequences are resolved):
\.START_SEQUENCE.*\n # Match the .START_SEQUENCE ... line
.*\n # Match (and ignore) the next line
((?:(?!\\.END_SEQUENCE).*\n)*+)
\.END_SEQUENCE # Match the .END_SEQUENCE line
((?:(?!\\.END_SEQUENCE).*\n)*+)
matches the rest of the lines in between and put the result into capturing group 1. Normally, ((?:.*\n)*?)
would suffice, but to prevent StackOverflowError
on big set of data, I switch to possessive quantifier *+
and a check (?!\\.END_SEQUENCE)
is needed so that the repetition can complete without backtracking.
Upvotes: 1
Reputation:
Not alot to go on but something like this, and capture group 1 contains data of interest.
(?-s)\.START_SEQUENCE.*\n.*\n([\S\s]*?)\.END_SEQUENCE
Upvotes: 1