Mohamad
Mohamad

Reputation: 35349

Regex removing extra characters

I'm using this pattern [^a-z0-9+\ ,#\-.] to filter tags before saving them to my database.

It works with an undesired side-effect; it removes accents: instalação becomes instalao

Any idea how I can keep accents intact while sticking to the pattern?

I'm using ColdFusion, so I assume it's based on Java Regex, but I could be wrong.

My intention is to allow letters (with accents), 0 to 9 arabic numbers, dots and hashes.

Upvotes: 1

Views: 361

Answers (3)

Valadas
Valadas

Reputation: 121

Use

[^\w]

\w matches any word character. In this case all non-word characters. or

\W

to match all non-word characters.

Upvotes: 2

Bart Kiers
Bart Kiers

Reputation: 170178

According the documentation \w matches any (Unicode) letter, digit but also underscores. If you don't want underscores, the you can do this:

[^[:alpha:]0-9#.-]

where [:alpha:] matches any (Unicode) letter. If you want to match digits outside the 0-9 range, try:

[^[:alnum:]##.-]

Note, the extra hash to escape ColdFusion's own tags, otherwise it would result in a mal-formed tag/variable error.

Upvotes: 5

lynks
lynks

Reputation: 5689

Have you tried the character classes? \w matches letters, numbers and underscore, and may just match accented characters, although I don't know for sure.

Upvotes: 2

Related Questions