user1663023
user1663023

Reputation:

Golang regex replace excluding quoted strings

I'm trying to implement the removeComments function in Golang from this Javascript implementation. I'm hoping to remove any comments from the text. For example:

/* this is comments, and should be removed */

However, "/* this is quoted, so it should not be removed*/"

In the Javascript implementation, quoted matching are not captured in groups, so I can easily filter them out. However, in Golang, it seems it's not easy to tell whether the matched part is captured in a group or not. So how can I implement the same removeComments logic in Golang as the same in the Javascript version?

Upvotes: 6

Views: 5523

Answers (6)

user557597
user557597

Reputation:

These do not preserve formatting


Preferred way (produces a NULL if group 1 is not matched)
works in golang playground -

     # https://play.golang.org/p/yKtPk5QCQV
     # fmt.Println(reg.ReplaceAllString(txt, "$1"))
     # (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)

     (?:                              # Comments 
          /\*                              # Start /* .. */ comment
          [^*]* \*+
          (?: [^/*] [^*]* \*+ )*
          /                                # End /* .. */ comment
       |  
          //  [^\n]*                       # Start // comment
          (?: \n | $ )                     # End // comment
     )
  |  
     (                                # (1 start), Non - comments 
          "
          [^"\\]*                          # Double quoted text
          (?: \\ [\S\s] [^"\\]* )*
          "
       |  
          '
          [^'\\]*                          # Single quoted text
          (?: \\ [\S\s] [^'\\]* )*
          ' 
       |  [\S\s]                           # Any other char
          [^/"'\\]*                        # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline)
     )                                # (1 end)

Alternative way (group 1 is always matched, but could be empty)
works in golang playground -

 # https://play.golang.org/p/7FDGZSmMtP
 # fmt.Println(reg.ReplaceAllString(txt, "$1"))
 # (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//[^\n]*(?:\n|$))?((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|[\S\s][^/"'\\]*)?)     

 (?:                              # Comments 
      /\*                              # Start /* .. */ comment
      [^*]* \*+
      (?: [^/*] [^*]* \*+ )*
      /                                # End /* .. */ comment
   |  
      //  [^\n]*                       # Start // comment
      (?: \n | $ )                     # End // comment
 )?
 (                                # (1 start), Non - comments 
      (?:
           "
           [^"\\]*                          # Double quoted text
           (?: \\ [\S\s] [^"\\]* )*
           "
        |  
           '
           [^'\\]*                          # Single quoted text
           (?: \\ [\S\s] [^'\\]* )*
           ' 
        |  [\S\s]                           # Any other char
           [^/"'\\]*                        # Chars which doesn't start a comment, string, escape, or line continuation (escape + newline)
      )?
 )                                # (1 end)

The Cadilac - Preserves Formatting

(Unfortunately, this can't be done in Golang because Golang cannot do Assertions)
Posted incase you move to a different regex engine.

     # raw:   ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
     # delimited:  /((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|("[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n|[\S\s])[^\/"'\\\s]*)/

     (                                # (1 start), Comments 
          (?:
               (?: ^ [ \t]* )?                  # <- To preserve formatting
               (?:
                    /\*                              # Start /* .. */ comment
                    [^*]* \*+
                    (?: [^/*] [^*]* \*+ )*
                    /                                # End /* .. */ comment
                    (?:                              # <- To preserve formatting 
                         [ \t]* \r? \n                                      
                         (?=
                              [ \t]*                  
                              (?: \r? \n | /\* | // )
                         )
                    )?
                 |  
                    //                               # Start // comment
                    (?:                              # Possible line-continuation
                         [^\\] 
                      |  \\ 
                         (?: \r? \n )?
                    )*?
                    (?:                              # End // comment
                         \r? \n                               
                         (?=                              # <- To preserve formatting
                              [ \t]*                          
                              (?: \r? \n | /\* | // )
                         )
                      |  (?= \r? \n )
                    )
               )
          )+                               # Grab multiple comment blocks if need be
     )                                # (1 end)

  |                                 ## OR

     (                                # (2 start), Non - comments 
          "
          [^"\\]*                          # Double quoted text
          (?: \\ [\S\s] [^"\\]* )*
          "
       |  
          '
          [^'\\]*                          # Single quoted text
          (?: \\ [\S\s] [^'\\]* )*
          ' 
       |  
          (?: \r? \n | [\S\s] )            # Linebreak or Any other char
          [^/"'\\\s]*                      # Chars which doesn't start a comment, string, escape,
                                           # or line continuation (escape + newline)
     )                                # (2 end)

Upvotes: 2

Uvelichitel
Uvelichitel

Reputation: 8490

Just for fun another approach, minimal lexer implemented as state machine, inspired by and well described in Rob Pike talk http://cuddle.googlecode.com/hg/talk/lex.html. Code is more verbose but more readable, understandable and hackable then regexp. Also it can work with any Reader and Writer, not strings only so don't consumes RAM and should even be faster.

type stateFn func(*lexer) stateFn

func run(l *lexer) {
    for state := lexText; state != nil; {
        state = state(l)
    }
}

type lexer struct {
    io.RuneReader
    io.Writer
}
func lexText(l *lexer) stateFn {
    for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
        switch r {
        case '"':
            l.Write([]byte(string(r)))
            return lexQuoted
        case '/':
            r, _, err = l.ReadRune()
            if r == '*' {
                return lexComment
            } else {
                l.Write([]byte("/"))
                l.Write([]byte(string(r)))
            }
        default:
            l.Write([]byte(string(r)))
        }
    }
    return nil
}
func lexQuoted(l *lexer) stateFn {
    for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
        if r == '"' {
            l.Write([]byte(string(r)))
            return lexText
        }
        l.Write([]byte(string(r)))
    }

    return nil
}

func lexComment(l *lexer) stateFn {
    for r, _, err := l.ReadRune(); err != io.EOF; r, _, err = l.ReadRune() {
        if r == '*' {
            r, _, err = l.ReadRune()
            if r == '/' {
                return lexText
            }
        }
    }

    return nil
}

You can see it works http://play.golang.org/p/HyvEeANs1u

Upvotes: 2

Steve Chambers
Steve Chambers

Reputation: 39424

Demo

Play golang demo

(The workings at each stage are output and the end result can be seen by scrolling down.)

Method

A few "tricks" are used to work around Golang's somewhat limited regex syntax:

  1. Replace start quotes and end quotes with a unique character. Crucially, the characters used to identify start and end quotes must be different from each other and extremely unlikely to appear in the text being processed.
  2. Replace all comment starters (/*) that are not preceeded by an unterminated start quote with a unique sequence of one or more characters.
  3. Similarly, replace all comment enders (*/) that are not succeeded by an end quote that does not have a start quote before it with a different unique sequence of one or more characters.
  4. Remove all remaining /*...*/ comment sequences.
  5. Unmask the previously "masked" comment starters/enders by reversing the replacements made in steps 2 and 3 above.

Limitations

The current demo doesn't address the possibility of a double quote appearing within a comment, e.g. /* Not expected: " */. Note: My feeling is this could be handled - just haven't put the effort in yet - so let me know if you think it could be an issue and I'll look into it.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

BACKGROUND

The correct way to do the task is to match and capture quoted strings (bearing in mind there can be escaped entities inside) and then matching the multiline comments.

REGEX IN-CODE DEMO

Here is the code to deal with that:

package main
import (
    "fmt"
    "regexp"
)
func main() {
    reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*")|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
        txt := `random text
            /* removable comment */
            "but /* never remove this */ one"
             more random *text*`
        fmt.Println(reg.ReplaceAllString(txt, "$1"))
}

See the Playground demo

EXPLANATION

The regex I suggest is written with the Best Regex Trick Ever concept in mind and consists of 2 alternatives:

  • ("[^"\\]*(?:\\.[^"\\]*)*") - Double quoted string literal regex - Group 1 (see the capturing group formed with the outer pair of unescaped parentheses and later accessible via replacement backreferences) matching double quoted string literals that can contain escaped sequences. This part matches:
    • " - a leading double quote
    • [^"\\]* - 0+ characters other than " and \ (as [^...] construct is a negated character class that matches any characters but those defined inside it) (the * is a zero or more occurrences matching quantifier)
    • (?:\\.[^"\\]*)*" - 0+ sequences (see the last * and the non-capturing group used only to group subpatterns without forming a capture) of an escaped sequence (the \\. matches a literal \ followed with any character) followed with 0+ characters other than " and \
  • | - or
  • /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ - multiline comment regex part matches *without forming a capture group (thus, unavailable from the replacement pattern via backreferences) and matches
    • / - the / literal slash
    • \* - the literal asterisk
    • [^*]* - zero or more characters other than an asterisk
    • \*+ - 1 or more (the + is a one or more occurrences matching quantifier) asterisks
    • (?:[^/*][^*]*\*+)* - 0+ sequences (non-capturing, we do not use it later) of any character but a / or * (see [^/*]), followed with 0+ characters other than an asterisk (see [^*]*) and then followed with 1+ asterisks (see \*+).
    • / - a literal (trailing, closing) slash.

NOTE: This multiline comment regex is the fastest I have ever tested. Same goes for the double quoted literal regex as "[^"\\]*(?:\\.[^"\\]*)*" is written with the unroll-the-loop technique in mind: no alternations, only character classes with * and + quantifiers are used in a specific order to allow the fastest matching.

NOTES ON PATTERN ENHANCEMENTS

If you plan to extend to matching single quoted literals, there is nothing easier, just add another alternative into the 1st capture group by re-using the double quoted string literal regex and replacing the double quotes with single ones:

reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/`)
                                                    ^-------------------------^

Here is the single- and double-quoted literal supporting regex demo removing the miltiline comments

Adding a single line comment support is similar: just add //[^\n\r]* alternative to the end:

reg := regexp.MustCompile(`("[^"\\]*(?:\\.[^"\\]*)*"|'[^'\\]*(?:\\.[^'\\]*)*')|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*[\r\n]*`)
                                                                                                              ^-----------^

Here is single- and double-quoted literal supporting regex demo removing the miltiline and singleline comments

Upvotes: 3

Laurel
Laurel

Reputation: 6173

I've never read/written anything in Go, so bear with me. Fortunately, I know regex. I did a little research on Go regexes, and it would seem that they lack most modern features (such as references).

Despite that, I've developed a regex that seems to be what you're looking for. I'm assuming that all strings are single line. Here it is:

reg := regexp.MustCompile(`(?m)^([^"\n]*)/\*([^*]+|(\*+[^/]))*\*+/`)

txt := `random text
        /* removable comment */
            "but /* never remove this */ one"
        more random *text*`

fmt.Println(reg.ReplaceAllString(txt, "${1}"))

Variation: The version above will not remove comments that happen after quotation marks. This version will, but it may need to be run multiple times.

reg := regexp.MustCompile(
   `(?m)^(([^"\n]*|("[^"\n]*"))*)/\*([^*]+|(\*+[^/]))*\*+/`
)
txt := `
   random text
   what /* removable comment */
   hi "but /* never remove this */ one" then /*whats here*/ i don't know /*what*/
   more random *text*
`
newtxt := reg.ReplaceAllString(txt, "${1}")
fmt.Println(newtxt)
newtxt = reg.ReplaceAllString(newtxt, "${1}")
fmt.Println(newtxt)

Explanation

  • (?m) means multiline mode. Regex101 gives a nice explanation of this:

    The ^ and $ anchors now match at the beginning/end of each line respectively, instead of beginning/end of the entire string.

    It needs to be anchored to the beginning of each line (with ^) to ensure a quote hasn't started.

  • The first regex has this: [^"\n]*. Essentially, it's matching everything that's not " or \n. I've added parenthesis because this stuff isn't comments, so it needs to be put back.

  • The second regex has this: (([^"\n]*|("[^"\n]*"))*). The regex, with this statement can either match [^"\n]* (like the first regex does), or (|) it can match a pair of quotes (and the content between them) with "[^"\n]*". It's repeating so it works when there are more than one quote pair, for example. Note that, like the simpler regex, this non-comment stuff is being captured.

  • Both regexes use this: /\*([^*]+|(\*+[^/]))*\*+/. It matches /* followed by any amount of either:

    • [^*]+ Non * chars

    or

    • \*+[^/] One or more *s that are not followed by /.
  • And then it matches the closing */

  • During replacement, the ${1} refers to the non-comment things that were captured, so they're reinserted into the string.

Upvotes: 2

tisov
tisov

Reputation: 69

Try this example..

play golang

Upvotes: 0

Related Questions