Reputation: 1065
I am trying to remove all Unicode characters from a file except for the Spanish characters.
Matching the different vowels has not been any issue and áéíóúÁÉÍÓÚ
are not replaced using the following regex (but all other Unicode appears to be replaced):
perl -pe 's/[^áéíóúÁÉÍÓÚ[:ascii:]]//g;' filename
But when I add the inverted question mark ¿ or exclamation mark ¡ to the regex other Unicode characters are also being matched and excluded that I would like to be removed:
perl -pe 's/[^áéíóúÁÉÍÓÚ¡¿[:ascii:]]//g;' filename
does not replace the following (some are not printable):
³ � �
Am I missing something obvious here? I am also open to other ways of doing this on the terminal.
Upvotes: 3
Views: 347
Reputation: 627082
You have a UTF8 encoded file and work with Unicode chars, thus, you need to pass specific set of options to let Perl know of that.
You should add -Mutf8
to let Perl recognize the UTF8-encoded characters used directly in your Perl code.
Also, you need to pass -CSD
(equivalent to -CIOED
) in order to have your input decoded and output re-encoded. This value is encoding dependent, it will work for UTF8 encoding.
perl -CSD -Mutf8 -pe 's/[^áéíóúñüÁÉÍÓÚÑÜ¡¿[:ascii:]]//g;' filename
Do not forget about Ü
and ü
.
Upvotes: 1