Grimson
Grimson

Reputation: 572

Removing comments using regex

I am building a parser, and I would like to remove comments from various lines. For example,

variable = "some//thing" ////actual comment

Comment marker is //. In this case, variable would contain "some//thing" and everything else would be ignored. I plan to do it using regex replace. Currently I am using (".*"|[ \t])*(\/\/.*) as regex. However replacing it replaces "some//thing" ////actual comment entirely.

I can not figure out the regex which I should use instead. Thanks for any help.

Additional info - I am using C# with netcoreapp 1.1.0

Edit - some cases might be of a line with just comment like //line comment. Strings also might contain escaped quotes.

Upvotes: 1

Views: 3029

Answers (2)

Chindraba
Chindraba

Reputation: 870

Here is the ugly regex pattern. I believe it will work well. I have tried it with every pathological example I can think of, including lines that contain syntax errors. For example, a quoted string that has too many quotes, or too few, or has a double escaped quote, which is, therefore, not escaped. And with quoted strings in the comments, which I have been known to do when I want to remind myself of alternatives.

The only time that it trips up is if there is a double slash inside a seemingly quoted string and somehow that string is malformed and the double slash ends up legally outside the properly quoted portion. Syntactically that makes it a valid comment, even though not the programmer's intention. So, from the programmer's perspective it's wrong, but by the rules, it's really a comment. Meaning, the pattern only appears to trip up.

When used the pattern will return the non-comment portion of the line(s). The pattern has a newline \n in it to allow for applying it to an entire file. You may need to modify that if you system interprets newlines in some other fashion, for example as \r or \r\n. To use it in single line mode you can remove that if you choose. It is at characters 17 and 18 in the one-liner and is on the fifth line, 6th and 7th printing characters in the multi-line version. You can safely leave it there, however, as in single-line mode it makes no difference, and in multi-line mode it will return a newline for lines of code that are either blank, or have a comment beginning in the first column. That will keep the line numbers the same in the original version and the stipped version if you write the results to a new file. Makes comparison easy.

One major caveat for this pattern: It uses a grouping construct that has varying level of support in regex engines. I believe as used here, with a lookaround, it's only the .NET and PCRE engines that will accept it YMMV. It is a tertiary type: (?(_condition_)_then_|_else_). The _condition_ pattern is treated as a zero-width assertion. If the pattern matches, then the _then_ pattern is used in the attempted match, otherwise the _else_ pattern is used. Without that construct, the pattern was growing to uncommon lengths, and was still failing on some of my pathological test cases.

The pattern presented here is as it needs to be seen by the regex engine. I am not a C# programmer, so I don't know all the nuances of escaping quoted strings. Getting this pattern into your code, such that all the backslashes and quotes are seen properly by the regex engine is still up to you. Maybe C# has the equivalent of Perl's heredoc syntax.

This is the one-liner pattern to use:

^((?:(?:(?:[^"'/\n]|/(?!/))*)(?("(?=(?:\\\\|\\"|[^"])*"))(?:"(?:\\\\|\\"|[^"])*")|(?('(?=(?:\\\\|\\'|[^'])*'))(?:'(?:\\\\|\\'|[^'])*')|(?(/)|.))))*)

If you want to use the ignore pattern whitespace option, you can use this version:

(?x) # Turn on the ignore white space option
^( # Start the only capturing group
    (?: # A non-capturing group to allow for repeating the logic
        (?: # Capture either of the two options below
            [^"'/\n] # Capture everything not a single quote, double quote, a slash, or a newline
            | # OR
            /(?!/) # Capture a slash not followed by a slash [slash an negative look-ahead slash]
        )* # As many times as possible, even if none
        (?(" # Start a conditional match for double-quoted strings
                (?=(?:\\\\|\\"|[^"])*") # Followed by a properly closed double-quoted string
            ) # Then
            (?:"(?:\\\\|\\"|[^"])*") # Capture the whole double-quoted string
            | # Otherwise
            (?(' # Start a conditional match for single-quoted strings
                (?=(?:\\\\|\\'|[^'])*') # Followed by a properly closed single-quoted string
                ) # Then
                (?:'(?:\\\\|\\'|[^'])*') # Capture the whole double-quoted string
                | # Otherwise
                (?([^/]) # If next character is not a slash
                .) # Capture that character, it is either a single quote, or a double quote not part of a properly closed
            ) # end the conditional match for single-quoted strings
        ) # End the conditional match for double-quoted strings
    )* # Close the repeating non-capturing group, capturing as many times as possible, even if none
) # Close the only capturing group

This allows for your code to explain this monstrosity so that when someone else looks at it, or in a few months you have to work on it yourself, there's no WTF moment. I think the comments explain it well, but feel free to change them any way you please.

As mentioned above, the conditional match grouping has limited support. One place it will fail is on the site you linked to in an earlier comment. Since you're using C#, I choose to do my testing in the .NET Regex Tester, which can handle those constructs. It includes a nice Reference too. Given the proper selections on the side, you can test either version above, and experiment with it as well. Considering its complexity, I would recommend testing it, somewhere, against data from your files, as well as any edge cases and pathological tests you can dream up.

Just to redeem this small pattern, there is a much bigger pattern for testing email address that is 78 columns by 81 lines, with a couple dozen characters to spare. (Which I do not recommend using, or any other regex, for testing email addresses. Wrong tool for the job.) If you want to scare yourself, have a peek at it on the ex-parrot site. I had nothing to do with that!!

Upvotes: 2

Whothehellisthat
Whothehellisthat

Reputation: 2152

"[^"\\]*(?:\\[\W\w][^"\\]*)*"|(\/\/.*)

Flags: global

Matches full strings or a comment.

Group 1: comment.

So if there's no comment, replace with the same matching text. Otherwise, do your thing on the comment itself.

Upvotes: 0

Related Questions