BasicYard
BasicYard

Reputation: 13

Regex skip in C++

This is my string:

/*
  Block1 {

    anythinghere
  }
*/

// Block2 { }
# Block3 { }

Block4 {

    anything here
}

I am using this regex to get each block name and inside content.

regex e(R"~((\w+)\s+\{([^}]+)\})~", std::regex::optimize);

But this regex gets all inside of description too. There is a “skip” option in PHP that you can use to skip all descriptions.

What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match

But this is C++ and I cannot use this skip method. What should I do to skip all descriptions and just get Block4 in C++ regex?

This regex detects Block1, Block2, Block3 and Block4 but I want to skip Block1, Block2, Block3 and just get Block4 (skip descriptions). How do I have to edit my regex to get just Block4 (everything outside the descriptions)?

Upvotes: 0

Views: 588

Answers (2)

user557597
user557597

Reputation:

Since you requested this long regex, here it is.

This will not handle nested Blocks like block{ block{ } }
it would match block{ block{ } } only.

Since you specified you are using C++11 as the engine, I didn't use
recursion. This is easily changed to use recursion say if you were to use
PCRE or Perl, or even BOOST::Regex. Let me know if you'd want to see that.

As it is it's flawed, but works for your sample.
Another thing it won't do is parse Preprocessor Directives '#...' because
I forgot the rules for that (thought I did it recently, but can't find a record).

To use it, sit in a while ( regex_search() ) loop looking for a match on
capture group 1, if (m[1].success) etc.. That will be your block.
The rest of the matches are for comments, quotes, or non-comments, unrelated
to the block. These have to be matched to progress the match position.

The code is long and redundant because there is no function calls (recursion) in the C++11 EMCAscript. Like I said, use boost::regex or something.

Benchmark

Sample:

/*
  Block1 {

    anythinghere
  }
*/

// Block2 { }

Block4 {

   // CommentedBlock{ asdfasdf }
    anyth"}"ing here
}

Block5 {

   /* CommentedBlock{ asdfasdf }
    anyth}"ing here
   */
}

Results:

Regex1:   (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   8
Elapsed Time:    1.95 s,   1947.26 ms,   1947261 µs

Regex Explained:

    # Raw:        (?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})|[\S\s](?:(?!\w+\s*\{(?:(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|(?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?!\})[\S\s][^}/"'\\]*))*\})[^/"'\\])*)
    # Stringed:  "(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})|[\\S\\s](?:(?!\\w+\\s*\\{(?:(?:/\\*[^*]*\\*+(?:[^/*][^*]*\\*+)*/|//(?:[^\\\\]|\\\\\\n?)*?\\n)|(?:\"[^\"\\\\]*(?:\\\\[\\S\\s][^\"\\\\]*)*\"|'[^'\\\\]*(?:\\\\[\\S\\s][^'\\\\]*)*'|(?!\\})[\\S\\s][^}/\"'\\\\]*))*\\})[^/\"'\\\\])*)"     


    (?:                              # Comments 
         /\*                              # Start /* .. */ comment
         [^*]* \*+
         (?: [^/*] [^*]* \*+ )*
         /                                # End /* .. */ comment
      |  
         //                               # Start // comment
         (?: [^\\] | \\ \n? )*?           # Possible line-continuation
         \n                               # End // comment
    )
 |                                 # OR,

    (?:                              # Non - comments 
         "
         [^"\\]*                          # Double quoted text
         (?: \\ [\S\s] [^"\\]* )*
         "
      |  '
         [^'\\]*                          # Single quoted text
         (?: \\ [\S\s] [^'\\]* )*
         ' 
      |  
         (                                # (1 start), BLOCK
              \w+ \s* \{               
              ####################
              (?:                              # ------------------------
                   (?:                              # Comments  inside a block
                        /\*                             
                        [^*]* \*+
                        (?: [^/*] [^*]* \*+ )*
                        /                                
                     |  
                        //                               
                        (?: [^\\] | \\ \n? )*?
                        \n                               
                   )
                |  
                   (?:                              # Non - comments inside a block
                        "
                        [^"\\]*                          
                        (?: \\ [\S\s] [^"\\]* )*
                        "
                     |  '
                        [^'\\]*                          
                        (?: \\ [\S\s] [^'\\]* )*
                        ' 
                     |  
                        (?! \} )
                        [\S\s]                          
                        [^}/"'\\]*                      
                   )
              )*                               # ------------------------
              #####################          
              \}                               
         )                                # (1 end), BLOCK

      |                                 # OR,

         [\S\s]                           # Any other char
         (?:                              # -------------------------
              (?!                              # ASSERT: Here, cannot be a BLOCK{ }
                   \w+ \s* \{                      
                   (?:                              # ==============================
                        (?:                              # Comments inside a block
                             /\*                              
                             [^*]* \*+
                             (?: [^/*] [^*]* \*+ )*
                             /                                
                          |  
                             //                               
                             (?: [^\\] | \\ \n? )*?
                             \n                               
                        )
                     |  
                        (?:                              # Non - comments inside a block
                             "
                             [^"\\]*                          
                             (?: \\ [\S\s] [^"\\]* )*
                             "
                          |  
                             '
                             [^'\\]*                          
                             (?: \\ [\S\s] [^'\\]* )*
                             ' 
                          |  
                             (?! \} )
                             [\S\s]                          
                             [^}/"'\\]*                       
                        )
                   )*                               # ==============================
                   \}                               
              )                                # ASSERT End

              [^/"'\\]                         # Char which doesn't start a comment, string, escape,
                                               # or line continuation (escape + newline)
         )*                               # -------------------------
    )                                # Done Non - comments 

Upvotes: 1

Jay Elston
Jay Elston

Reputation: 2078

Tl;DR: Regular expressions cannot be used to parse full blown computer languages. What you want to do cannot be done with regular expressions. You need to develop a mini-C++ parser to filter out comments. The answer to this related question might point you in the right direction.

Regex can be used to process regular expressions, but computer languages such as C++, PHP, Java, C#, HTML, etc. have a more complex syntax that includes a property named "middle recursion". Middle recursion includes complications such as an arbitrary number of matching parenthesis, begin / end quotes, and comments that can contain symbols

If you want to understand this in more detail, read the answers to this question about the difference between regular expressions and context free grammars. If you are really curious, enroll in a Formal Language Theory class.

Upvotes: 1

Related Questions