killerbee13
killerbee13

Reputation: 61

remove line comments from string

I am writing a text parser, that needs to be able to remove comments from lines. I am using a rather simple language wherein all comments are initiated by a # character, and it would be simple to remove everything after that, but I have to deal with the possibility that the # is inside of a string.

My question, therefore, is, given a string such as
Value="String#1";"String#2"; # This is an array of "-delimited strings, "Like this"
How best can I extract the substring
Value="String#1";"String#2"; (note the trailing space)

Note that the comment may contain quotes, and also, the entire line may choose between " and ' deliminating, although it will be consistent over the entire line. This is known beforehand, if it is important. Quotes within the strings will be escaped by a \

Upvotes: 4

Views: 593

Answers (1)

Paul Draper
Paul Draper

Reputation: 83245

std::string stripComment(std::string str) {
    bool escaped = false;
    bool inSingleQuote = false;
    bool inDoubleQuote = false;
    for(std::string::const_iterator it = str.begin(); it != str.end(); it++) {
         if(escaped) {
             escaped = false;
         } else if(*it == '\\' && (inSingleQuote || inDoubleQuote)) {
             escaped = true;
         } else if(inSingleQuote) {
             if(*it == '\'') {
                 inSingleQuote = false;
             }
         } else if(inDoubleQuote) {
             if(*it == '"') {
                 inDoubleQuote = false;
             }
         } else if(*it == '\'') {
             inSingleQuote = true;
         } else if(*it == '"') {
             inDoubleQuote = true;
         } else if(*it == '#') {
             return std::string(str.begin(), it);
         }
    }
    return str;
}

EDIT: Or a more textbook FSM,

std::string stripComment(std::string str) {
    int states[5][4] = {
    //      \  '  "
        {0, 0, 1, 2,}
        {1, 3, 0, 1,},  //single quoted string
        {2, 4, 2, 0,},  //double quoted string
        {1, 1, 1, 1,},  //escape in single quoted string
        {2, 2, 2, 2,},  //escape in double quoted string
    };
    int state = 0;
    for(std::string::const_iterator it = str.begin(); it != str.end(); it++) {
        switch(*it) {
            case '\\':
                state = states[state][1];
                break;
            case '\'':
                state = states[state][2];
                break;
            case '"':
                state = states[state][3];
                break;
            case '#':
                if(!state) {
                    return std::string(str.begin(), it);
                }
            default:
                state = states[state][0];
        }          
    }
    return str;
}

The states array defines the transitioning between states of the FSM.

The first index is the current state, 0, 1, 2, 3, or 4.

The second index corresponds to the character, \, ', ", or another character.

The array tells the next state, based on the current state and the character.

FYI, these assume that backslashes escape any character in a string. You at least need them to escape backslashes, so you can have have a string ending with backslash.

Upvotes: 4

Related Questions