emsimpson92
emsimpson92

Reputation: 1778

Matching multiple heading styles using regex

I'm trying to use regex to capture section headings, but why is it that I am able to capture "4.1 General" with this, however if I add a newline to the end of the regex \n([\d\.]+ ?\w+)\n it no longer captures that line? Is it not followed by a newline or am I missing something?

Here's my example for reference

\n([\d\.]+ ?\w+)

Input

3.6.10
POLLUTION DEGREE 4
continuous conductivity occurs due to conductive dust, rain or other wet conditions
3.6.11
CLEARANCE
shortest distance in air between two conductive parts
3.6.12
CREEPAGE DISTANCE
shortest distance along the surface of a solid insulating material between two conductive
parts
4 Tests
4.1 General
Tests in this standard are TYPE TESTS to be carried out on samples of equipment or parts.

\n([\d\.]+ ?\w+)\n? doesn't seem to work either.

Upvotes: 0

Views: 58

Answers (2)

Sten Petrov
Sten Petrov

Reputation: 11040

Have you considered that the new line may not be a single character?

\n([0-9\.]+ ?\w+)(\n|\r)

Using Expresso the above regex has 4 matches from your sample, the last one is

[LF]4.1 General[CR]

where [LF] is \n and [CR] is \r.

Keep in mind [CR], [LF] and [CRLF] are all possible designations for end of line.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

It is a classical case of overlapping matches. The previous match contains \n4 Tests\n and that last \n is already consumed, thus preventing the next match.

I see you want to match texts that are whole lines of the text, so, it makes more sense to use ^ and $ anchors with the RegexOptions.Multiline option:

@"(?m)^([\d.]+ ?\w+)\r?$"

See the .NET regex online demo

Note that $ in a .NET regex matches only before \n and since Windows line endings are CRLF, it is required to use an optional CR before $, \r?.

Results:

enter image description here

Upvotes: 2

Related Questions