wazy
wazy

Reputation: 1065

How to replace all unicode characters except for Spanish ones?

I am trying to remove all Unicode characters from a file except for the Spanish characters.

Matching the different vowels has not been any issue and áéíóúÁÉÍÓÚ are not replaced using the following regex (but all other Unicode appears to be replaced):

perl -pe 's/[^áéíóúÁÉÍÓÚ[:ascii:]]//g;' filename

But when I add the inverted question mark ¿ or exclamation mark ¡ to the regex other Unicode characters are also being matched and excluded that I would like to be removed:

perl -pe 's/[^áéíóúÁÉÍÓÚ¡¿[:ascii:]]//g;' filename does not replace the following (some are not printable): ³ � � ­

Am I missing something obvious here? I am also open to other ways of doing this on the terminal.

Upvotes: 3

Views: 347

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You have a UTF8 encoded file and work with Unicode chars, thus, you need to pass specific set of options to let Perl know of that.

You should add -Mutf8 to let Perl recognize the UTF8-encoded characters used directly in your Perl code.

Also, you need to pass -CSD (equivalent to -CIOED) in order to have your input decoded and output re-encoded. This value is encoding dependent, it will work for UTF8 encoding.

perl -CSD -Mutf8 -pe 's/[^áéíóúñüÁÉÍÓÚÑÜ¡¿[:ascii:]]//g;' filename

Do not forget about Ü and ü.

Upvotes: 1

Related Questions