Reputation: 6874
I have a .rtf
file that has lots of bold titles in it. I am trying to capture data between two bold titles. However, the tags used to say something is bold are exactly the same either end of the text.
So I am trying to find the pattern that will capture the bold tag on the next nearest line (and everything in between) rather than on the same line. I am using Java.
Example text:
\par }{\b\f1\fs24\ul\insrsid14762702 Data}{\insrsid14762702 \tab \tab }{\b\f1\fs24
\par Start:\tab 2015-01-14 10:56:25
\par Duration:\tab 22:40:23
\par Positions:\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm
\par Sensor Position(s):\tab -10.0, 5.0 cm
\par Depth:\tab N/A
\par
\par }{\b\f1\fs24
\par }{\b\f1\fs24\ul\insrsid14762702 History}{\insrsid14762702
\par Other
{\b\f1\fs24\ul\insrsid14762702
What I am currently using:
((\\\\b\\\\f1\\\\fs24.+?\\{\\\\b\\\\f1\\\\fs24))
The whole Java line is:
Pattern pattern = Pattern.compile("((\\\\b\\\\f1\\\\fs24.+?\\{\\\\b\\\\f1\\\\fs24))",Pattern.DOTALL);
Which is giving me:
\par }{\b\f1\fs24\ul\insrsid14762702 Data}{\insrsid14762702 \tab \tab }{\b\f1\fs24
\par }{\b\f1\fs24
\par }{\b\f1\fs24
{\b\f1\fs24\ul\insrsid14762702 History}{\insrsid14762702
\par Other
{\b\f1\fs24
The expected output is:
\par }{\b\f1\fs24\ul\insrsid14762702 Data}{\insrsid14762702 \tab \tab }{\b\f1\fs24
\par Start:\tab 2015-01-14 10:56:25
\par Duration:\tab 22:40:23
\par Positions:\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm
\par Sensor Position(s):\tab -10.0, 5.0 cm
\par Depth:\tab N/A
\par
\par }{\b\f1\fs24
And:
\par }{\b\f1\fs24
\par }{\b\f1\fs24
And:
\par }{\b\f1\fs24\ul\insrsid14762702 History}{\insrsid14762702
\par Other
{\b\f1\fs24\ul\insrsid14762702
Upvotes: 1
Views: 114
Reputation: 1065
You need multiline regex like below:
String text = "\\par }{\\b\\f1\\fs24\\ul\\insrsid14762702 Data}{\\insrsid14762702 \\tab \\tab }{\\b\\f1\\fs24\n" +
"\\par Start:\\tab 2015-01-14 10:56:25\n" +
"\\par Duration:\\tab 22:40:23\n" +
"\\par Positions:\\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm\n" +
"\\par Sensor Position(s):\\tab -10.0, 5.0 cm\n" +
"\\par Depth:\\tab N/A\n" +
"\\par \n" +
"\\par }{\\b\\f1\\fs24\n" +
"\\par }{\\b\\f1\\fs24\\ul\\insrsid14762702 History}{\\insrsid14762702 \n" +
"\\par Other \n" +
"{\\b\\f1\\fs24\\ul\\insrsid14762702";
Pattern pattern = Pattern.compile("(?mi)\\\\b(?<content>.*)\\\\b");
Matcher matcher = pattern.matcher(text);
while(matcher.find()){
String content = matcher.group("content");
System.out.println("content: "+ content);
}
Upvotes: 1
Reputation: 785058
You can use 2 captured groups for this. One for starting tag and text upto ending tag (which shouldn't be on the same line). You will need a lookahead to be able to match overlapping matches. 2nd captured group will be inside the lookahead.
Regex you can use:
([^\n]*\Q{\b\f1\fs24\E[^\n]*\n.*?)(?=([^\n]*\Q{\b\f1\fs24\E))
PS: Note use of Pattern.quote
to avoid excessive escaping.
Code:
String text = "\\par }{\\b\\f1\\fs24\\ul\\insrsid14762702 Data}{\\insrsid14762702 \\tab \\tab }{\\b\\f1\\fs24\n\\par Start:\\tab 2015-01-14 10:56:25\n\\par Duration:\\tab 22:40:23\n\\par Positions:\\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm\n\\par Sensor Position(s):\\tab -10.0, 5.0 cm\n\\par Depth:\\tab N/A\n\\par \n\\par }{\\b\\f1\\fs24\n\\par }{\\b\\f1\\fs24\\ul\\insrsid14762702 History}{\\insrsid14762702 \n\\par Other \n{\\b\\f1\\fs24\\ul\\insrsid14762702";
String tag = Pattern.quote("{\\b\\f1\\fs24");
Pattern p = Pattern.compile( "([^\n]*" + tag + "[^\n]*\n.*?)(?=([^\n]*" + tag + "))",
Pattern.DOTALL );
Matcher m = p.matcher( text );
List<String> matches = new ArrayList<>();
while(m.find()) {
matches.add(m.group(1) + m.group(2));
}
for (String s: matches)
System.err.println(s + "\n");
Output:
\par }{\b\f1\fs24\ul\insrsid14762702 Data}{\insrsid14762702 \tab \tab }{\b\f1\fs24
\par Start:\tab 2015-01-14 10:56:25
\par Duration:\tab 22:40:23
\par Positions:\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm
\par Sensor Position(s):\tab -10.0, 5.0 cm
\par Depth:\tab N/A
\par
\par }{\b\f1\fs24
\par }{\b\f1\fs24
\par }{\b\f1\fs24
\par }{\b\f1\fs24\ul\insrsid14762702 History}{\insrsid14762702
\par Other
{\b\f1\fs24
Upvotes: 1