Reputation: 61
I am writing a text parser, that needs to be able to remove comments from lines. I am using a rather simple language wherein all comments are initiated by a # character, and it would be simple to remove everything after that, but I have to deal with the possibility that the # is inside of a string.
My question, therefore, is, given a string such as
Value="String#1";"String#2"; # This is an array of "-delimited strings, "Like this"
How best can I extract the substring
Value="String#1";"String#2";
(note the trailing space)
Note that the comment may contain quotes, and also, the entire line may choose between " and ' deliminating, although it will be consistent over the entire line. This is known beforehand, if it is important. Quotes within the strings will be escaped by a \
Upvotes: 4
Views: 593
Reputation: 83245
std::string stripComment(std::string str) {
bool escaped = false;
bool inSingleQuote = false;
bool inDoubleQuote = false;
for(std::string::const_iterator it = str.begin(); it != str.end(); it++) {
if(escaped) {
escaped = false;
} else if(*it == '\\' && (inSingleQuote || inDoubleQuote)) {
escaped = true;
} else if(inSingleQuote) {
if(*it == '\'') {
inSingleQuote = false;
}
} else if(inDoubleQuote) {
if(*it == '"') {
inDoubleQuote = false;
}
} else if(*it == '\'') {
inSingleQuote = true;
} else if(*it == '"') {
inDoubleQuote = true;
} else if(*it == '#') {
return std::string(str.begin(), it);
}
}
return str;
}
EDIT: Or a more textbook FSM,
std::string stripComment(std::string str) {
int states[5][4] = {
// \ ' "
{0, 0, 1, 2,}
{1, 3, 0, 1,}, //single quoted string
{2, 4, 2, 0,}, //double quoted string
{1, 1, 1, 1,}, //escape in single quoted string
{2, 2, 2, 2,}, //escape in double quoted string
};
int state = 0;
for(std::string::const_iterator it = str.begin(); it != str.end(); it++) {
switch(*it) {
case '\\':
state = states[state][1];
break;
case '\'':
state = states[state][2];
break;
case '"':
state = states[state][3];
break;
case '#':
if(!state) {
return std::string(str.begin(), it);
}
default:
state = states[state][0];
}
}
return str;
}
The states
array defines the transitioning between states of the FSM.
The first index is the current state, 0
, 1
, 2
, 3
, or 4
.
The second index corresponds to the character, \
, '
, "
, or another character.
The array tells the next state, based on the current state and the character.
FYI, these assume that backslashes escape any character in a string. You at least need them to escape backslashes, so you can have have a string ending with backslash.
Upvotes: 4