Component 10
Component 10

Reputation: 10497

Extracting quoted and unquoted values using regex

I'm trying to to parse a string of type <tag>=<value> using regular expressions but have hit some issues adding support for quoted values. The idea is that any unquoted values should be trimmed of leading / trailing white space so that [ Hello ] becomes [Hello] (Pls ignore the square brackets.)

However, when the value is quoted, I want anything up to and including the double quotes to be removed but no further, so [ " Hello World " ] would become [" Hello World "]

So far, I've come up with the following code with a pattern match for this (note that some of the character have been escaped or doubly escaped to avoid them being interpreted as tri-graphs or other C format characters.)

void getTagVal( const std::string& tagVal )
{
    boost::smatch what;
    static const boost::regex pp("^\\s*([a-zA-Z0-9_-]+)\\s*=\\s*\"\?\?([%:\\a-zA-Z0-9 /\\._]+?)\"\?\?\\s*$");

    if ( boost::regex_match( tagVal, what, pp ) )
    {
        const string tag = static_cast<const string&>( what[1] );
        const string val = static_cast<const string&>( what[2] );

        cout << "Tag = [" << tag << "] Val = [" << val << "]" << endl;
    }
}

int main( int argc, char* argv[] )
{
    getTagVal("Qs1= \" Hello World \" ");
    getTagVal("Qs2=\" Hello World \" ");
    getTagVal("Qs3= \" Hello World \"");
    getTagVal("Qs4=\" Hello World \"");
    getTagVal("Qs5=\"Hello World \"");
    getTagVal("Qs6=\" Hello World\"");
    getTagVal("Qs7=\"Hello World\"");

    return 0;
}

Taking out the double escaping, this breaks down as:

For the example calls in main(), I would expect to get:

Tag = [Qs1] Val = [ Hello World ]
Tag = [Qs2] Val = [ Hello World ]
Tag = [Qs3] Val = [ Hello World ]
Tag = [Qs4] Val = [ Hello World ]
Tag = [Qs5] Val = [Hello World ]
Tag = [Qs6] Val = [ Hello World]
Tag = [Qs7] Val = [Hello World]

but what I actually get is:

Tag = [Qs1] Val = [" Hello World ]
Tag = [Qs2] Val = [" Hello World ]
Tag = [Qs3] Val = [" Hello World ]
Tag = [Qs4] Val = [" Hello World ]
Tag = [Qs5] Val = ["Hello World ]
Tag = [Qs6] Val = [" Hello World]
Tag = [Qs7] Val = ["Hello World]

So it's almost correct but for some reason the first quote is hanging around in the output value even though I specifically bracket the value section of the regex with the quote outside it.

Upvotes: 0

Views: 1452

Answers (2)

Component 10
Component 10

Reputation: 10497

Figured out what the problem was.

When using \ you have to be careful as this is processed within the C string and so needs to be escaped there, but it will also be processed by the regex engine so if you're not careful \\a becomes \a which is absolutely not what you wanted.

So, to tell it that I want a \ to be in my set of characters in the value (which I do as ironically, they're being used as escape sequences within a format string) then you have to double escape them so

static const boost::regex pp("^\\s*([a-zA-Z0-9_-]+)\\s*=\\s*\"\?\?([%:\\a-zA-Z0-9 /\\._]+?)\"\?\?\\s*$");

becomes:

static const boost::regex pp("^\\s*([a-zA-Z0-9_-]+)\\s*=\\s*\"\?\?([%:\\\\a-zA-Z0-9 /._]+?)\"\?\?\\s*$");

(i.e. you need to make it \\\\)

Upvotes: 0

FrankPl
FrankPl

Reputation: 13315

I would change the part starting with the first quote to an alternative:

"([^"]+)"|([%:\a-zA-Z0-9 /\._]+)\s*

You would then have to handle the two possibilities of quoted or unquoted text ending up in the second or third capturing parenthesis pair in the host code around the regex.

Upvotes: 1

Related Questions