Tagc
Tagc

Reputation: 9072

Regex in C++ Not Working with Square Brackets

I'm trying to write regex expressions to validate XML files and extract the strings stored between tags in C++.

This is one of the regex expressions I'm aiming for:

"<[^/]*?>"

This doesn't work however. Neither does something simpler like this:

 "<[a-z]*>"

However, this produces a match:

 "<.*>"

It doesn't seem like brackets are able to be matched.

Below is the relevant part of the code I'm using:

string testString = "<test>";

regex xmlRegOpenTag("<[^/]*?>", regex_constants::extended); 
smatch smOpen;
cout << regex_match(testString, smOpen, xmlRegOpenTag) << endl;

string openCap = smOpen[0];
cout << "openCap: " << openCap << endl;

I've tried using other flags like regex_constants::basic, etc. Nothing seems to be working. I'm compiling using gcc version 4.7.3.

To those mentioning that I shouldn't be parsing XML using regex: I only need to parse XML files that I've created myself, so it isn't a problem.

I'm using the C++11 standard. In my header file, I'm including regex as such:

#include <regex>
using namespace std;

When using the first regex expression ("<[^/]*?>"), I get:

terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error
Abort

When using the second regex expression ("<[a-z]*>"), I get:

0
openCap: 

When using the third regex expression ("<.*>"), I get:

1
openCap: <test>

This is the information I can provide about the compiler I'm using:

Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.7/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu/Linaro 4.7.3-1ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.7/README.Bugs --enable-languages=c,c++,go,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.7 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.7 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --enable-plugin --with-system-zlib --enable-objc-gc --with-cloog --enable-cloog-backend=ppl --disable-cloog-version-check --disable-ppl-version-check --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.7.3 (Ubuntu/Linaro 4.7.3-1ubuntu1) 

Upvotes: 0

Views: 2419

Answers (3)

Joakim Bergkvist
Joakim Bergkvist

Reputation: 21

I had the same problem. It appears character set matching (with square brackets) is broken in gcc4.x with the default ECMA script syntax. Using std::regex:extended parser seem to work. i.e

std::regex re(".*", std::regex::ECMAScript); -> ok
std::regex re("[a-z]", std::regex::ECMAScript); -> regex_error
std::regex re("[a-z]", std::regex::extended); -> ok

Upvotes: 2

Cu3PO42
Cu3PO42

Reputation: 1473

First of all, XML is not a regular language and you shouldn't try to use RegExes to parse it, eventually it will give you some real bad head aches, you should rather use one of the available parsers for XML. For example say you have something such as "<foo><bar /></foo>", something such as <.*>will match the whole string and not just the first tag, but the whole string. You can try to use 'lazy' matching with <.*?>, which tries to match as little characters as possible, but that might still break if you have an >inside a string in a property, for example.

Now, let's just pretend that parsing XML with RegExes wouldn't be a problem: all the RegExes you gave should match <test> and do so in the implementations I tried, which suggests that there is a bug in your code or the library you use, but I don't see one in your code and the standard implementation of regex shouldn't be buggy either...

EDIT: I just tried in C++ and the RegExes work as well. In a minimalist implementation

regex reg("<[^/]*>");
if (regex_match("<test>", reg))
    cout << "Matched..." << endl;
else
    cout << "Didn't match..." << endl;

yields the output "Matched..." - and <[a-z]*> works as well. I used clang-500.2.79 in this expirement. This basically confirms that the implementation supplied with your compiler is faulty.

Upvotes: 2

Rakesh KR
Rakesh KR

Reputation: 6527

The regex you tried

[^/]* indicates any character except: '/' (0 or more times (matching the most amount possible))

[a-z]* indicates any character of: 'a' to 'z' (0 or more times (matching the most amount possible))

.* indicate any character (0 or more times (matching the most amount possible))

Upvotes: 0

Related Questions