Reputation: 2246
I am in the process of converting a regex library (thousands of perl regex's) and have come across a major problem.
This is the expression that I have to translate into static xpressive :
(?<![A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝ]\. )[mM]\.(?! [A-ZÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝ]\. )
This expression has before
and after
negating validation conditions.
Which means that normally I should use ~after
and ~before
.
However, as there are multibyte characters, I have to put them in as string literals.
My initial attempt was therefore like this :
~after(range('A', 'Z')| as_xpr("À")| as_xpr("Á")| as_xpr("Â")| as_xpr("Ã")| as_xpr("Ä")|
as_xpr("Å")| as_xpr("Ç")| as_xpr("È")| as_xpr("É")| as_xpr("Ê")| as_xpr("Ë")|
as_xpr("Ì")| as_xpr("Í")| as_xpr("Î")| as_xpr("Ï")| as_xpr("Ñ")| as_xpr("Ò")|
as_xpr("Ó")| as_xpr("Ô")| as_xpr("Õ")| as_xpr("Ö")| as_xpr("Ø")| as_xpr("Ù")|
as_xpr("Ú")| as_xpr("Û")| as_xpr("Ü")| as_xpr("Ý") | as_xpr(". ") ) >>
(set= 'm', 'M') >> '.' >>
~before(range('A', 'Z')| as_xpr("À")| as_xpr("Á")| as_xpr("Â")| as_xpr("Ã")| as_xpr("Ä")|
as_xpr("Å")| as_xpr("Ç")| as_xpr("È")| as_xpr("É")| as_xpr("Ê")| as_xpr("Ë")|
as_xpr("Ì")| as_xpr("Í")| as_xpr("Î")| as_xpr("Ï")| as_xpr("Ñ")| as_xpr("Ò")|
as_xpr("Ó")| as_xpr("Ô")| as_xpr("Õ")| as_xpr("Ö")| as_xpr("Ø")| as_xpr("Ù")|
as_xpr("Ú")| as_xpr("Û")| as_xpr("Ü")| as_xpr("Ý") | as_xpr(". ") )
However, as this gives a variable number of characters, it will not compile.
Is there anyway that I can implement this regex correctly in static xpressive ?
Upvotes: 0
Views: 49
Reputation: 2246
The, rather ugly, solution involves multiple before
and after
elements.
Here is the solution, which has been tested and works :
(~after(range('A', 'Z') >> as_xpr(". ") >> as_xpr(". ")) >>
~after(as_xpr("À") >> as_xpr(". ")) >> ~after(as_xpr("Á") >> as_xpr(". ")) >> ~after(as_xpr("Â") >> as_xpr(". ")) >>
~after(as_xpr("Ã") >> as_xpr(". ")) >> ~after(as_xpr("Ä") >> as_xpr(". ")) >> ~after(as_xpr("Å") >> as_xpr(". ")) >>
~after(as_xpr("Ç") >> as_xpr(". ")) >> ~after(as_xpr("È") >> as_xpr(". ")) >> ~after(as_xpr("É") >> as_xpr(". ")) >>
~after(as_xpr("Ê") >> as_xpr(". ")) >> ~after(as_xpr("Ë") >> as_xpr(". ")) >> ~after(as_xpr("Ì") >> as_xpr(". ")) >>
~after(as_xpr("Í") >> as_xpr(". ")) >> ~after(as_xpr("Î") >> as_xpr(". ")) >> ~after(as_xpr("Ï") >> as_xpr(". ")) >>
~after(as_xpr("Ñ") >> as_xpr(". ")) >> ~after(as_xpr("Ò") >> as_xpr(". ")) >> ~after(as_xpr("Ó") >> as_xpr(". ")) >>
~after(as_xpr("Ô") >> as_xpr(". ")) >> ~after(as_xpr("Õ") >> as_xpr(". ")) >> ~after(as_xpr("Ö") >> as_xpr(". ")) >>
~after(as_xpr("Ø") >> as_xpr(". ")) >> ~after(as_xpr("Ù") >> as_xpr(". ")) >> ~after(as_xpr("Ú") >> as_xpr(". ")) >>
~after(as_xpr("Û") >> as_xpr(". ")) >> ~after(as_xpr("Ü") >> as_xpr(". ")) >> ~after(as_xpr("Ý") >> as_xpr(". ")) ) >>
(boost::xpressive::set= 'm', 'M') >> '.' >>
(~before(as_xpr(" ") >> range('A', 'Z') >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("À") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Á") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Â") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Ã") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ä") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Å") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ç") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("È") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("É") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Ê") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ë") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Ì") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Í") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Î") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ï") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Ñ") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ò") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Ó") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ô") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Õ") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ö") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Ø") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ù") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Ú") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Û") >> as_xpr(". ")) >> ~before(as_xpr(" ") >> as_xpr("Ü") >> as_xpr(". ")) >>
~before(as_xpr(" ") >> as_xpr("Ý") >> as_xpr(". ")) )
As said before, it is ugly, but this overcomes the limitations of xpressive
not working with UTF8 characters natively.
Upvotes: 0