Priyanka
Priyanka

Reputation: 107

Regex code for removing single and multi-line comments from C code

I have the following regular expression to remove multi-line comment but I am having a hard time trying to figure out how to remove comments starting with //.

When I add (//.*) as the regular expression it never seems to work.

 pattern = r"""
                        ##  --------- COMMENT ---------
       /\*              ##  Start of /* ... */ comment
       [^*]*\*+         ##  Non-* followed by 1-or-more *'s
       (                ##
         [^/*][^*]*\*+  ##
       )*               ##  0-or-more things which don't start with /
                        ##    but do end with '*'
       /                ##  End of /* ... */ comment
                        ##
        |               ## --------- COMMENT ---------
         (//.*)         ## Start of // comment
                        ##
     |                  ##  -OR-  various things which aren't comments:
       (                ##
                        ##  ------ " ... " STRING ------
         "              ##  Start of " ... " string
         (              ##
           \\.          ##  Escaped char
         |              ##  -OR-
           [^"\\]       ##  Non "\ characters
         )*             ##
         "              ##  End of " ... " string
       |                ##  -OR-
                        ##
                        ##  ------ ' ... ' STRING ------
         '              ##  Start of ' ... ' string
         (              ##
           \\.          ##  Escaped char
         |              ##  -OR-
           [^'\\]       ##  Non '\ characters
         )*             ##
         '              ##  End of ' ... ' string
       |                ##  -OR-
                        ##
                        ##  ------ ANYTHING ELSE -------
         .              ##  Anything other char
         [^/"'\\]*      ##  Chars which doesn't start a comment, string
       )                ##    or escape

"""

Could some one please tell me where am i going wrong ? I even tried the following regular expression:

//[^\r\n]*$

but that doesn't work either.

Upvotes: 2

Views: 519

Answers (1)

user557597
user557597

Reputation:

Try one of these...

They both capture comments and non-comments.


This one does Not preserve formatting and uses no modifiers.
From a find while loop, store Group 1 (comments) in a new file,
replace with Group 2 (non-comments) in the original file.
Adjust the regex line break as necessary. Ie. Change \n to \r\n etc...

   # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)


   (                                # (1 start), Comments 
        /\*                              # Start /* .. */ comment
        [^*]* \*+
        (?: [^/*] [^*]* \*+ )*
        /                                # End /* .. */ comment
     |  
        //                               # Start // comment
        (?: [^\\] | \\ \n? )*?           # Possible line-continuation
        \n                               # End // comment
   )                                # (1 end)
|  
   (                                # (2 start), Non - comments 
        "
        (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
        "
     |  '
        (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
        ' 
     |  [\S\s]                           # Any other char
        [^/"'\\]*                        # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)

Last Rework -
Does a much better job preserving formatting.
The formatting problem pertaining to newlines is addressed from the comment tail.
While this fixes the problem of string concatenation it does leave an occasional blanked
line where the comment was. For %98 of the comments, this won't be an issue.
But, time to leave this dead dog alone.

This one preserves formatting. It uses the regex modifier Multi-Line (be sure to set that).
Do the same as above.
This assumes your engine supports \h horizontal tab. If not let me know.
Adjust the regex line break as necessary. Ie. Change \n to \r\n etc...

   #  ((?:(?:^\h*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:\h*\n(?=\h*(?:\n|/\*|//)))?|//(?:[^\\]|\\\n?)*?(?:\n(?=\h*(?:\n|/\*|//))|(?=\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\\s]*)

   (                                # (1 start), Comments 
        (?:
             (?: ^ \h* )?                     # <- To preserve formatting
             (?:
                  /\*                              # Start /* .. */ comment
                  [^*]* \*+
                  (?: [^/*] [^*]* \*+ )*
                  /                                # End /* .. */ comment
                  (?:
                       \h* \n                                      
                       (?=                              # <- To preserve formatting 
                            \h*                              # <- To preserve formatting
                            (?: \n | /\* | // )              # <- To preserve formatting
                       )
                  )?                               # <- To preserve formatting
               |  
                  //                               # Start // comment
                  (?: [^\\] | \\ \n? )*?           # Possible line-continuation
                  (?:                              # End // comment
                       \n                               
                       (?=                              # <- To preserve formatting
                            \h*                              # <- To preserve formatting
                            (?: \n | /\* | // )              # <- To preserve formatting
                       )
                    |  (?= \n )
                  )
             )
        )+                               # Grab multiple comment blocks if need be
   )                                # (1 end)

|                                 ## OR

   (                                # (2 start), Non - comments 
        "
        (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
        "
     |  '
        (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
        ' 
     |  [\S\s]                           # Any other char
        [^/"'\\\s]*                      # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)

Upvotes: 1

Related Questions