Sietse
Sietse

Reputation: 250

Java regex filter headers

I'm trying to filter headings from a big document.

Like this:

5.1.8 Reports

5 technische en applicatiearchitectuur

this version number 5.5.5 (or 5.5) should stay in the text but the 2 sentences above should be removed

The problem is that I don't want to remove any version numbers etc. I tried (\d.), but is there a way to write a regex that only removes headers and leaves the version numbers in the text?

Upvotes: 2

Views: 270

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627327

You can use

(?m)^(\d+(?:\.\d+)*\.?)\h+.*

Replace with $1 backreference. See the regex demo.

In Java:

String result = s.replaceAll("(?m)^(\\d+(?:\\.\\d+)*\\.?)\\h+.*", "$1");

Details

  • (?m)^ - start of the line
  • (\d+(?:\.\d+)*\.?) - Group 1:
    • \d+ - 1 or more digits
    • (?:\.\d+)* - 0+ sequences of a . followed with 1+ digits
    • \.? - an optional dot
  • \h+ - 1 or more horizontal whitespace
  • .* - the rest of the line

Java demo:

String s = "5.1.8 Reports\n\n5 technische en applicatiearchitectuur\n\nthis version number 5.5.5 (or 5.5) should stay in the text but the 2 sentences above should be removed";
String result= s.replaceAll("(?m)^(\\d+(?:\\.\\d+)*\\.?)\\h+.*", "$1");
System.out.println(result); 

Result

5.1.8

5

this version number 5.5.5 (or 5.5) should stay in the text but the 2 sentences above should be removed

Upvotes: 2

Related Questions