Reputation: 5190
I am trying to catch comments from c/c++/java files but I cannot find a way to skip whitespaces that may exist after a new line. My regex pattern is
regex reg("(//.*|/\\*(.|\\n)*?\\*/)");
For example in the following code (dont bother about the random code snippets, they could be anything...) I correctly catch comments:
// my program in C++
#include <iostream>
/** playing around in
a new programming language **/
using namespace std;
and the output is:
// my program in C++
/** playing around in
a new programming language **/
However, when i have code with whitespaces on a multiline comment like:
int main(){
/* start always points to the first node of the linked list.
temp is used to point to the last node of the linked list.*/
node *start,*temp;
start = (node *)malloc(sizeof(node));
temp = start;
temp -> next = NULL;
temp -> prev = NULL;
/* Here in this code, we take the first node as a dummy node.
The first node does not contain data, but it used because to avoid handling special cases
in insert and delete functions.
*/
printf("1. Insert\n");
I capture:
/* start always points to the first node of the linked list.
temp is used to point to the last node of the linked list.*/
/* Here in this code, we take the first node as a dummy node.
The first node does not contain data, but it used because to avoid handling special cases
in insert and delete functions.
*/
instead of:
/* start always points to the first node of the linked list.
temp is used to point to the last node of the linked list.*/
/* Here in this code, we take the first node as a dummy node.
The first node does not contain data, but it used because to avoid handling special cases
in insert and delete functions.
*/
How can I get around it within the regex pattern to avoid this?
NOTE: If possible, I would like to avoid string manupulators etc, just with regex modification.
Upvotes: 2
Views: 98
Reputation: 627101
Converting my comment above.
It is impossible to match discontinuous text. Instead, you can match a part of a text with a regex and then post-process the matched (or captured) value with another regex or with string manipulations.
Here is an example (not the best, just to show the concept):
string data("int main(){// Singleline content\n /* start always points to the first node of the linked list.\n temp is used to point to the last node of the linked list.*/\n node *start,*temp;\n start = (node *)malloc(sizeof(node));\n temp = start;\n temp -> next = NULL;\n temp -> prev = NULL;\n /* Here in this code, we take the first node as a dummy node.\n The first node does not contain data, but it used because to avoid handling special cases\n in insert and delete functions.\n */\n printf(\"1. Insert\n\");");
//std::cout << "Data: " << data << std::endl;
std::regex pattern(R"(//.*|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)");
std::smatch result;
while (regex_search(data, result, pattern)) {
std::cout << std::regex_replace(result[0].str(), std::regex(R"((^|\n)[^\S\r\n]+)"), "$1") << std::endl;
data = result.suffix().str();
}
See the IDEONE demo
NOTE: Raw string literals simplify regex definition.
The R"(//.*|/\*[^*]*\*+(?:[^/*][^*]*\*+)*/)"
matches either //
+ any 0+ characters but a newline (singleline comments) and /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
matches /*
followed with 0+ non-*
s followed with 1+ *
s that is followed with 0+ sequences of a character other than /
and *
and then 0+ non-*
and then 1+ *
s (multiline comments). This multiline comment is much more efficient than the one you have since it is written acc. to the unroll-the-loop technique.
I removed the first horizontal whitespace(s) on a line with regex_replace(result[0].str(), std::regex(R"((^|\n)[^\S\r\n]+)"), "$1")
: (^|\n)[^\S\r\n]+
matches and captures a start-of-string anchor or a newline followed with 1+ characters other than non-whitespace, CR, and LF.
Upvotes: 1