Reputation:
So I have this string I want to remove non alphanumeric characters from:
my_string = "¿Habla usted Inglés, por favor?"
Basically I want to get rid of the ?, ¿ and , in this case. I then split the words into a list and do various kickass things with each one.
I am using
String.replace(my_string, my_regex, "")
String.split(" ")
to do the work. I have two different regex strings I'm attempting to use:
my_regex = ~r/[\_\.,:;\?¿¡\!&@$%\^]/
my_regex = ~r/[[:punct:]]/
The first one works like a charm. I end up with:
["habla", "usted", "inglés"]
The second one removes the correct characters but I end up with:
[<<194, 104, 97, 98, 108, 97>>, "usted", <<105, 110, 103, 108, 195, 115>>]
At first I thought the strange output was just because of the non-ascii alphas being dumped to the console. But when I attempt to match with the expected list of strings it fails.
Whatever the case, I just don't understand why the two different regex result in different output in terms of the strings in the list.
Here is code that can be run in iex to succinctly reproduce my issue:
a = ~r/[\_\.,:;\?¿¡\!&@$%\^]/
b = ~r/[[:punct:]]/
y = "¿Habla usted Inglés, por favor?"
String.replace(y, a, "")
# -> "Habla usted Inglés por favor"
String.replace(y, b, "")
# -> <<194, 72, 97, 98, 108, 97, 32, 117, 115, 116, 101, 100, 32, 73, 110, 103, 108, 195, 115, 32, 112, 111, 114, 32, 102, 97, 118, 111, 114>>
Upvotes: 2
Views: 752
Reputation: 23556
While Dean Taylor described how to make it work I will describe why the output was what it was before.
First of all, when computing started we needed to have some way to translate letter to numbers to have some uniform standard that we can use, skip a lot of history and we have ended with American Standard Code for Information Interchange known as ASCII. The ASCII standard is 7-bit encoding, which mean that highest bit on most machines is always set to 0
when working with ASCII. The problem with ASCII is that it is very English-centric and contain only 24 basic latin letters and do not support any diacritics from other languages. Form this need the idea was, just use that top bit and allow to have another 127 codes to use.
So now we had some solution, but quickly other problem was raised - there is need for much more letters. The problem was how to fit them. The first and for the time simplest solution was to use something known as "code pages", which was the table how to understand the codes with the top bit set. So we ended with a lot of codepages for different parts of the world.
So far so good.
Unless not. Codepages had big flaw - only one of the codepages could be used at the same time in one document, so for example you couldn't have Danish (ISO-8859-1) and Russian (ISO-8859-2) letters at the same document as each set of characters used the same codes for different characters, for example Øи
would be impossible as both occupy exactly the same codes in each of their own codepages. Whoops…
So after that there comes Unicode, which wanted to fix that whole mess. In Unicode each letter has assigned code, but be wary, that this code aren't bytes that are dumped into the file and it is like that. These bytes needs to be encoded in some way. Most popular encodings nowadays are:
Ok, so now we know how to encode characters. But there is one thing more, to simplify conversion (and due to highly western-centric committee) the first codepage that is used in Unicode is ISO-8859-1 code page.
Now we are close to the the solution of the mystery.
Erlang (which is older than Unicode by at least 5 years) was developed in Sweden by Ericsson, this mean that they naturally picked the code page that was natural there - ISO-8859-1. This codepage also contain Spanish characters like ¿
which was encoded as BF
(hex, 191
dec). And by the above rules, in UTF-8, this character is encoded as C2 BF
bytes into the binary. But your regex do not stated that it want to use unicode character groups, so the Erlang assumed that you want to use default ISO-8859-1 codepage where BF
byte is a punctuation. That is why that character was removed from the original string.
As why the first version worked. As Elixir uses UTF-8 binaries to store strings your regex didn't matched on ¿
but rather separately for each of the bytes C2
and BF
as it was earlier converted to the same as ~r/[\xC2\xBF]/
"internally", and this is perfectly valid regex. This is also why letter é
ended mangled, as it is encoded as C3 A9
where A9
in given codepage mean ©
(which also is treated as punctuation). That mean that you are ending with 2 strings that aren't valid UTF-8 strings and Elixir inspect
will not try to present them.
Upvotes: 4
Reputation: 121000
If you want to remove non-alphanumeric characters, you should indeed remove non-alphanumeric characters (and probably non-spaces,) not [:punct:]
.
"¿Habla usted Inglés, por favor?"
|> String.replace(~r/[^[:alnum:]\s]+/u, "")
#⇒ "Habla usted Inglés por favor"
Upvotes: 1
Reputation: 41981
Include the Unicode u
flag to get Unicode support.
e.g.
a = ~r/[\_\.,:;\?¿¡\!&@$%\^]/u
b = ~r/[[:punct:]]/u
Can be seen running here: https://ideone.com/0nQKlq
Upvotes: 4