Reputation:

String.replace function returning non-string output

So I have this string I want to remove non alphanumeric characters from:

my_string = "¿Habla usted Inglés, por favor?"

Basically I want to get rid of the ?, ¿ and , in this case. I then split the words into a list and do various kickass things with each one.

I am using

String.replace(my_string, my_regex, "")
String.split(" ")

to do the work. I have two different regex strings I'm attempting to use:

my_regex = ~r/[\_\.,:;\?¿¡\!&@$%\^]/
my_regex = ~r/[[:punct:]]/

The first one works like a charm. I end up with:

["habla", "usted", "inglés"]

The second one removes the correct characters but I end up with:

[<<194, 104, 97, 98, 108, 97>>, "usted", <<105, 110, 103, 108, 195, 115>>]

At first I thought the strange output was just because of the non-ascii alphas being dumped to the console. But when I attempt to match with the expected list of strings it fails.

Whatever the case, I just don't understand why the two different regex result in different output in terms of the strings in the list.

Here is code that can be run in iex to succinctly reproduce my issue:

a = ~r/[\_\.,:;\?¿¡\!&@$%\^]/
b = ~r/[[:punct:]]/
y = "¿Habla usted Inglés, por favor?"
String.replace(y, a, "")  
    # ->  "Habla usted Inglés por favor"
String.replace(y, b, "")
    # -> <<194, 72, 97, 98, 108, 97, 32, 117, 115, 116, 101, 100, 32, 73, 110, 103, 108, 195, 115, 32, 112, 111, 114, 32, 102, 97, 118, 111, 114>>

Upvotes: 2

Answers (3)

Hauleth

Reputation: 23556

While Dean Taylor described how to make it work I will describe why the output was what it was before.

First of all, when computing started we needed to have some way to translate letter to numbers to have some uniform standard that we can use, skip a lot of history and we have ended with American Standard Code for Information Interchange known as ASCII. The ASCII standard is 7-bit encoding, which mean that highest bit on most machines is always set to 0 when working with ASCII. The problem with ASCII is that it is very English-centric and contain only 24 basic latin letters and do not support any diacritics from other languages. Form this need the idea was, just use that top bit and allow to have another 127 codes to use.

So now we had some solution, but quickly other problem was raised - there is need for much more letters. The problem was how to fit them. The first and for the time simplest solution was to use something known as "code pages", which was the table how to understand the codes with the top bit set. So we ended with a lot of codepages for different parts of the world.

So far so good.

Unless not. Codepages had big flaw - only one of the codepages could be used at the same time in one document, so for example you couldn't have Danish (ISO-8859-1) and Russian (ISO-8859-2) letters at the same document as each set of characters used the same codes for different characters, for example Øи would be impossible as both occupy exactly the same codes in each of their own codepages. Whoops…

So after that there comes Unicode, which wanted to fix that whole mess. In Unicode each letter has assigned code, but be wary, that this code aren't bytes that are dumped into the file and it is like that. These bytes needs to be encoded in some way. Most popular encodings nowadays are:

UTF-16 which encodes characters using 16-bits per "segment" - it seemed as a good idea at first, and due to that it was picked by Java and Microsoft as format to store things internally; unfortunately it is very wasteful (ASCII codes instead of 8-bits now require twice as more, which mean that all text files are at least twice it's original sizeO, it required BOM to know how to read files (endianness is important), and to add to that, it quickly became clear, that 16-bits isn't enough to store all characters, so some of the characters need to be encoded as 2 16-bits numbers (inflating files even more)
UTF-8 which is variable length encoding which uses "plain old ASCII" for characters that can be encoded as ASCII and special bit magic to store higher bytes

Ok, so now we know how to encode characters. But there is one thing more, to simplify conversion (and due to highly western-centric committee) the first codepage that is used in Unicode is ISO-8859-1 code page.

Now we are close to the the solution of the mystery.

Erlang (which is older than Unicode by at least 5 years) was developed in Sweden by Ericsson, this mean that they naturally picked the code page that was natural there - ISO-8859-1. This codepage also contain Spanish characters like ¿ which was encoded as BF (hex, 191 dec). And by the above rules, in UTF-8, this character is encoded as C2 BF bytes into the binary. But your regex do not stated that it want to use unicode character groups, so the Erlang assumed that you want to use default ISO-8859-1 codepage where BF byte is a punctuation. That is why that character was removed from the original string.

As why the first version worked. As Elixir uses UTF-8 binaries to store strings your regex didn't matched on ¿ but rather separately for each of the bytes C2 and BF as it was earlier converted to the same as ~r/[\xC2\xBF]/ "internally", and this is perfectly valid regex. This is also why letter é ended mangled, as it is encoded as C3 A9 where A9 in given codepage mean © (which also is treated as punctuation). That mean that you are ending with 2 strings that aren't valid UTF-8 strings and Elixir inspect will not try to present them.

Upvotes: 4

Aleksei Matiushkin

Reputation: 121000

If you want to remove non-alphanumeric characters, you should indeed remove non-alphanumeric characters (and probably non-spaces,) not [:punct:].

"¿Habla usted Inglés, por favor?"
|> String.replace(~r/[^[:alnum:]\s]+/u, "")
#⇒ "Habla usted Inglés por favor"

Upvotes: 1

Dean Taylor

Reputation: 41981

Include the Unicode u flag to get Unicode support.

e.g.

a = ~r/[\_\.,:;\?¿¡\!&@$%\^]/u
b = ~r/[[:punct:]]/u

Can be seen running here: https://ideone.com/0nQKlq

Upvotes: 4

String.replace function returning non-string output

Answers (3)

Related Questions