Sebastian Zeki
Sebastian Zeki

Reputation: 6874

RegEx to span two lines, matching only on separate line

I have a .rtf file that has lots of bold titles in it. I am trying to capture data between two bold titles. However, the tags used to say something is bold are exactly the same either end of the text.

So I am trying to find the pattern that will capture the bold tag on the next nearest line (and everything in between) rather than on the same line. I am using Java.

Example text:

\par }{\b\f1\fs24\ul\insrsid14762702 Data}{\insrsid14762702 \tab \tab }{\b\f1\fs24
\par Start:\tab 2015-01-14 10:56:25
\par Duration:\tab 22:40:23
\par Positions:\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm
\par Sensor Position(s):\tab -10.0, 5.0 cm
\par Depth:\tab N/A
\par 
\par }{\b\f1\fs24
\par }{\b\f1\fs24\ul\insrsid14762702 History}{\insrsid14762702 
\par Other 
{\b\f1\fs24\ul\insrsid14762702

What I am currently using:

((\\\\b\\\\f1\\\\fs24.+?\\{\\\\b\\\\f1\\\\fs24))

The whole Java line is:

Pattern pattern = Pattern.compile("((\\\\b\\\\f1\\\\fs24.+?\\{\\\\b\\\\f1\\\\fs24))",Pattern.DOTALL);

Which is giving me:

\par }{\b\f1\fs24\ul\insrsid14762702 Data}{\insrsid14762702 \tab \tab }{\b\f1\fs24

\par }{\b\f1\fs24
    \par }{\b\f1\fs24

{\b\f1\fs24\ul\insrsid14762702 History}{\insrsid14762702 
    \par Other 
    {\b\f1\fs24

The expected output is:

\par }{\b\f1\fs24\ul\insrsid14762702 Data}{\insrsid14762702 \tab \tab }{\b\f1\fs24
    \par Start:\tab 2015-01-14 10:56:25
    \par Duration:\tab 22:40:23
    \par Positions:\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm
    \par Sensor Position(s):\tab -10.0, 5.0 cm
    \par Depth:\tab N/A
    \par 
    \par }{\b\f1\fs24

And:

 \par }{\b\f1\fs24
    \par }{\b\f1\fs24

And:

\par }{\b\f1\fs24\ul\insrsid14762702 History}{\insrsid14762702 
    \par Other 
    {\b\f1\fs24\ul\insrsid14762702

Upvotes: 1

Views: 114

Answers (2)

Riz
Riz

Reputation: 1065

You need multiline regex like below:

 String text = "\\par }{\\b\\f1\\fs24\\ul\\insrsid14762702 Data}{\\insrsid14762702 \\tab \\tab }{\\b\\f1\\fs24\n" +
"\\par Start:\\tab 2015-01-14 10:56:25\n" +
"\\par Duration:\\tab 22:40:23\n" +
"\\par Positions:\\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm\n" +
"\\par Sensor Position(s):\\tab -10.0, 5.0 cm\n" +
"\\par Depth:\\tab N/A\n" +
"\\par \n" +
"\\par }{\\b\\f1\\fs24\n" +
"\\par }{\\b\\f1\\fs24\\ul\\insrsid14762702 History}{\\insrsid14762702 \n" +
"\\par Other \n" +
"{\\b\\f1\\fs24\\ul\\insrsid14762702";

Pattern pattern = Pattern.compile("(?mi)\\\\b(?<content>.*)\\\\b");
Matcher matcher =  pattern.matcher(text);

while(matcher.find()){
  String content = matcher.group("content");
  System.out.println("content: "+ content);
}

Upvotes: 1

anubhava
anubhava

Reputation: 785058

You can use 2 captured groups for this. One for starting tag and text upto ending tag (which shouldn't be on the same line). You will need a lookahead to be able to match overlapping matches. 2nd captured group will be inside the lookahead.

Regex you can use:

([^\n]*\Q{\b\f1\fs24\E[^\n]*\n.*?)(?=([^\n]*\Q{\b\f1\fs24\E))

RegEx Demo

PS: Note use of Pattern.quote to avoid excessive escaping.

Code:

String text = "\\par }{\\b\\f1\\fs24\\ul\\insrsid14762702 Data}{\\insrsid14762702 \\tab \\tab }{\\b\\f1\\fs24\n\\par Start:\\tab 2015-01-14 10:56:25\n\\par Duration:\\tab 22:40:23\n\\par Positions:\\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm\n\\par Sensor Position(s):\\tab -10.0, 5.0 cm\n\\par Depth:\\tab N/A\n\\par \n\\par }{\\b\\f1\\fs24\n\\par }{\\b\\f1\\fs24\\ul\\insrsid14762702 History}{\\insrsid14762702 \n\\par Other \n{\\b\\f1\\fs24\\ul\\insrsid14762702";       
String tag = Pattern.quote("{\\b\\f1\\fs24");

Pattern p = Pattern.compile( "([^\n]*" + tag + "[^\n]*\n.*?)(?=([^\n]*" + tag + "))",
            Pattern.DOTALL );

Matcher m = p.matcher( text );

List<String> matches = new ArrayList<>();
while(m.find()) {
    matches.add(m.group(1) + m.group(2));
}

for (String s: matches)
    System.err.println(s + "\n");

Output:

\par }{\b\f1\fs24\ul\insrsid14762702 Data}{\insrsid14762702 \tab \tab }{\b\f1\fs24
\par Start:\tab 2015-01-14 10:56:25
\par Duration:\tab 22:40:23
\par Positions:\tab 3.0, 5.0, 7.0, 9.0, 15.0, 17.0 cm
\par Sensor Position(s):\tab -10.0, 5.0 cm
\par Depth:\tab N/A
\par 
\par }{\b\f1\fs24

\par }{\b\f1\fs24
\par }{\b\f1\fs24

\par }{\b\f1\fs24\ul\insrsid14762702 History}{\insrsid14762702 
\par Other 
{\b\f1\fs24

Upvotes: 1

Related Questions