How to not capture whitespaces after a new line with regex in c++

Question

I am trying to catch comments from c/c++/java files but I cannot find a way to skip whitespaces that may exist after a new line. My regex pattern is

regex reg("(//.*|/\*(.|\n)*?\*/)");

For example in the following code (dont bother about the random code snippets, they could be anything...) I correctly catch comments:

// my  program in C++
#include 
/** playing around in
a new programming language **/
using namespace std;

and the output is:

// my  program in C++
/** playing around in
a new programming language **/

However, when i have code with whitespaces on a multiline comment like:

int main(){
        /* start always points to the first node of the linked list.
           temp is used to point to the last node of the linked list.*/
        node *start,*temp;
        start = (node *)malloc(sizeof(node));
        temp = start;
        temp -> next = NULL;
        temp -> prev = NULL;
        /* Here in this code, we take the first node as a dummy node.
           The first node does not contain data, but it used because to avoid handling special cases
           in insert and delete functions.
         */
        printf("1. Insert
");

I capture:

/* start always points to the first node of the linked list.
           temp is used to point to the last node of the linked list.*/
/* Here in this code, we take the first node as a dummy node.
           The first node does not contain data, but it used because to avoid handling special cases
           in insert and delete functions.
         */

instead of:

/* start always points to the first node of the linked list.
temp is used to point to the last node of the linked list.*/
/* Here in this code, we take the first node as a dummy node.
The first node does not contain data, but it used because to avoid handling special cases
in insert and delete functions.
*/

How can I get around it within the regex pattern to avoid this?

NOTE: If possible, I would like to avoid string manupulators etc, just with regex modification.

Wiktor Stribiżew · Accepted Answer

Converting my comment above.

It is impossible to match discontinuous text. Instead, you can match a part of a text with a regex and then post-process the matched (or captured) value with another regex or with string manipulations.

Here is an example (not the best, just to show the concept):

string data("int main(){// Singleline content
        /* start always points to the first node of the linked list.
           temp is used to point to the last node of the linked list.*/
        node *start,*temp;
        start = (node *)malloc(sizeof(node));
        temp = start;
        temp -> next = NULL;
        temp -> prev = NULL;
        /* Here in this code, we take the first node as a dummy node.
           The first node does not contain data, but it used because to avoid handling special cases
           in insert and delete functions.
         */
        printf("1. Insert
");");
    //std::cout << "Data: " << data << std::endl;
    std::regex pattern(R"(//.*|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)");
    std::smatch result;

    while (regex_search(data, result, pattern)) {
        std::cout << std::regex_replace(result[0].str(), std::regex(R"((^|
)[^\S
]+)"), "$1") << std::endl;
        data = result.suffix().str();
    }

See the IDEONE demo

NOTE: Raw string literals simplify regex definition.

The R"(//.*|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)" matches either // + any 0+ characters but a newline (singleline comments) and /\*[^*]*\*+(?:[^/*][^*]*\*+)*/ matches /* followed with 0+ non-*s followed with 1+ *s that is followed with 0+ sequences of a character other than / and * and then 0+ non-* and then 1+ *s (multiline comments). This multiline comment is much more efficient than the one you have since it is written acc. to the unroll-the-loop technique.

I removed the first horizontal whitespace(s) on a line with regex_replace(result[0].str(), std::regex(R"((^| )[^\S ]+)"), "$1"): (^| )[^\S ]+ matches and captures a start-of-string anchor or a newline followed with 1+ characters other than non-whitespace, CR, and LF.

How to not capture whitespaces after a new line with regex in c++

Answers (1)

Related Questions