oscilatingcretin
oscilatingcretin

Reputation: 10959

What's causing this regex to match everything?

I am trying to use this regex:

^(\s+)<ProjectReference(.|\s)+?(Project2)</Name>(.|\s)+?</ProjectReference>

...to locate only this section:

    <ProjectReference Include="..\..\Project2\Project2.csproj">
      <Project>{6c2a7631-8b47-4ae9-a68f-f728666105b9}</Project>
      <Name>Project2</Name>
    </ProjectReference>

...in the below document:

what is causing this text up here to be selected??

    <ProjectReference Include="..\..\Project1\Project1\Project1.csproj">
      <Project>{714c6b26-c609-40a4-80a9-421bd842562d}</Project>
      <Name>Project1</Name>
    </ProjectReference>


  <ItemGroup>
    <ProjectReference Include="..\..\Project2\Project2.csproj">
      <Project>{6c2a7631-8b47-4ae9-a68f-f728666105b9}</Project>
      <Name>Project2</Name>
    </ProjectReference>
    <ProjectReference Include="..\..\Project3\Project3\Project3.csproj">
      <Project>{39860208-8146-429f-a1d1-5f8ed2fd7f5f}</Project>
      <Name>Project3</Name>
    </ProjectReference>
    <ProjectReference Include="..\..\Project4\Project4.csproj">
      <Project>{58144d60-19d9-4d11-8ae6-088e03ccf874}</Project>
      <Name>Project4</Name>
    </ProjectReference>
    <ProjectReference Include="..\..\Project5\Project5.csproj">
      <Project>{33baa509-ad24-4a72-a2fc-8f297e75e90d}</Project>
      <Name>Project5</Name>
    </ProjectReference>
  </ItemGroup>
  <PropertyGroup>
    <VisualStudioVersion Condition="'$(VisualStudioVersion)' == ''">10.0</VisualStudioVersion>
    <VSToolsPath Condition="'$(VSToolsPath)' == ''">$(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)</VSToolsPath>
  </PropertyGroup>

In Notepad++, it appears to initially locate the match, but then it proceeds to match the entire document in a second match (so it's finding 2 matches total). I originally discovered this in my .NET app when my utility was replacing the entire contents of my project file with an empty string, effectively clearing the entire thing out.

I've spent over an hour toiling over this, so let's see if SE can figure it out.

Update: Though I've marked an answer that actually works, I ended up going with a not-so-magical approach to ensure that no rare regex quirks creep into my code later down the road as was the case recently.

^(\s+)<ProjectReference.+?({0})\.(csproj|vbproj).*\r\n.*\r\n\s+<Name>{0}</Name>\r\n\s*</ProjectReference>

...where {0} is the name of my project. While more verbose, this solution is less likely to bug out with excessive matches. I use RegexOptions.Multiline in my .NET app so that I can anchor to the beginning of a line.

Upvotes: 3

Views: 183

Answers (2)

Alan Moore
Alan Moore

Reputation: 75242

First, never use (.|\s) to match everything-including-newlines; it's a freeze-up waiting to happen (see this answer for more info). The search dialog in Notepad++ includes a check box for that purpose, labelled . matches newline.

Second, you should not be getting that result, no matter what. I've reproduced it in a local copy of Notepad++, and it looks like a bug. Maybe the regex is freezing, and NPP is failing to catch the error. At any rate, you should be getting only one match, and that's what happens when I select . matches newline and change your regex to this:

^\h*<ProjectReference.*?Project2</Name>.*?</ProjectReference>

However, it still matches too much, encompassing both the Project1 and Project2 elements. That's because non-greedy quantifiers only affect where matching ends, not where it begins. You need to use something more specific to make sure the match doesn't extend beyond the element where it started. I think this is the surest way to do that:

^\h*<ProjectReference(?:(?!</?ProjectReference).)*Project2</Name>.*?</ProjectReference>

The idea is that the dot is allowed any match character (including newlines), unless it's the first character of the sequence <ProjectReference or </ProjectReference. So, once it starts matching the opening <ProjectReference> tag, it can match anything except another such tag (opening or closing), until it finds the sentinel string (Project2).

UPDATE: This is definitely a bug in Notepad++. I've done some more testing myself, and found other reports to confirm it (here and here). Those guys get pretty creative in their attempts to trigger the bug, but it boils down to this: if the regex takes too long to match, NPP incorrectly selects everything.

Upvotes: 2

Federico Piazza
Federico Piazza

Reputation: 31035

I think the best approach would be to use a xpath expression or a xml parser.

However, as you stated in your comment if you want to capture that specific portion using regex, then you can use this:

(<ProjectReference.*?Project2[\s\S]*?</ProjectReference>)

Working demo

Match information

MATCH 1
1.  [209-384]   `<ProjectReference Include="..\..\Project2\Project2.csproj">
      <Project>{6c2a7631-8b47-4ae9-a68f-f728666105b9}</Project>
      <Name>Project2</Name>
    </ProjectReference>`

Besides regex101 also used SublimeText to show it's working, however Notepad++ has a poor regex engine and usually messes it up with tricks like [\s\S]*?:

enter image description here

On the other hand, related to your question about "why is failing", your regex is not failing but your pattern allows that greedy match (even using the lazy operator) because of your (.|\s) alternation:

^(\s+)<ProjectReference(.|\s)+?(Project2)</Name>(.|\s)+?</ProjectReference>
                          ^--- HERE

If you check the Regex101 explanation, you can see:

2nd Capturing group (.|\s)+?
  Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
  Note: A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data
  1st Alternative: .
    . matches any character (except newline)
  2nd Alternative: \s
    \s match any white space character [\r\n\t\f ]

Upvotes: 3

Related Questions