HellSpam
HellSpam

Reputation:

Regular Expression engine that supports raw UTF-8?

I need a regular expression engine that supports raw UTF-8 - meaning, the UTF-8 string is stored in char * as two chars(or one, or less) - for example, Ab is the array {0x41,0x62}. Anyone know of an regex engine that can receive that format? I can convert to wchar_t if needed first.

Upvotes: 0

Views: 917

Answers (3)

lothar
lothar

Reputation: 20229

Dealing with the non constant character length nature of UTF-8 makes it very hard to create algorithms (like regex).

It's better to convert the utf-8 string to a unicode wstring with ICU and then use the wstring variant of boost::regex

Upvotes: 0

Benoît
Benoît

Reputation: 16994

This page says that it is possible with Boost.Regex, on the condition that you configure and use ICU library.

Upvotes: 2

majkinetor
majkinetor

Reputation: 9036

The current implementation of PCRE (release 7.x) corresponds approxi- mately with Perl 5.10, including support for UTF-8 encoded strings and Unicode general category properties. However, UTF-8 and Unicode support has to be explicitly enabled; it is not the default. The Unicode tables correspond to Unicode release 5.1.

Upvotes: 0

Related Questions