LOlliffe
LOlliffe

Reputation: 1695

How do I strip accents from characters in XSL?

I keep looking, but can't find an XSL function that is the equivalent of "normalize-space", for characters. That is, my content has accented UNICODE characters, which is great, but from that content, I'm creating a filename, where I don't want those accents.

So, is there something that I'm overlooking, or not googling properly, to easily process characters?

In the XML data:

<filename>gri_gonéwiththèw00mitc</filename>

In XSLT stylesheet:

<xsl:variable name="file">
    <xsl:value-of select="filename"/>
</xsl:variable>

<xsl:value-of select="$file"/>

results in "gri_gonéwiththèw00mitc"

where

<xsl:value-of select='replace( normalize-unicode( "$file", "NFKD" ), "[^\\p{ASCII}]", "" )'/>

results in nothing.

What I'm aiming for is gri_gonewiththew00mitc (no accents)

Am I using the syntax wrong?

Upvotes: 6

Views: 14384

Answers (4)

hmedz
hmedz

Reputation: 1

The top voted answer does not work anymore XPath2.0, as mentioned by Yuri. The 'IsBasicLatin' is an appropriate substitution for ASCII

The following code works:

replace(normalize-unicode('çgri_gonéwiththèmitç','NFKD'),'\P{IsBasicLatin}','')

Upvotes: 0

Yuri
Yuri

Reputation: 4498

The previously suggested ways contain unknownthe character class named 'ASCII'. In my experience, XPath 2.0 recognises the class 'BasicLatin', which should serve the same purpose as 'ASCII'.

replace(normalize-unicode('Lliç d'Am Oükl Úkřeč', 'NFKD'), '\P{IsBasicLatin}', '')

Upvotes: 2

user357812
user357812

Reputation:

In XSLT/XPath 1.0 if you want to replace those accented characters with the unaccented counterpart, you could use translate() function.

But, that assumes your "accented UNICODE characters" aren't composed unicode characters. If that were the case, you would need to use XPath 2.0 normalize-unicode() function.

And, if the real goal is to have a valid URI, you should use encode-for-uri()

Update: Examples

translate('gri_gonéwiththèw00mitc','áàâäéèêëíìîïóòôöúùûü','aaaaeeeeiiiioooouuuu')

Result: gri_gonewiththew00mitc

encode-for-uri('gri_gonéwiththèw00mitc')

Result: gri_gon%C3%A9withth%C3%A8w00mitc

Correct expression provide suggest by @biziclop:

replace(normalize-unicode('gri_gonéwiththèw00mitc','NFKD'),'\P{ASCII}','')

Result: gri_gonewiththew00mitc

Note: In XPath 2.0, the correct character class negation is with a capital \P.

Upvotes: 9

biziclop
biziclop

Reputation: 49804

So, contrary to my comment, you could try this:

replace( normalize-unicode( "öt hűtőházból kértünk színhúst", "NFKD" ), "[^\\p{ASCII}]", "" )

Although be warned that any characters which can't be decomposed and aren't basic ASCII (Norwegian ø or Icelandic Þ for example) will be completely deleted from the string, but that's probably okay with your requirements.

Upvotes: 3

Related Questions