Gautam Kumar
Gautam Kumar

Reputation: 53

How to remove ellipsis at the end of Strings in R

I have list of words, which i got from below code.

tags_vector <- unlist(tags_used)

Some of the strings in this list has ellipsis in the end,which i want to remove. Here i print the 5th element of this list, and its class

tags_vector[5]
#[1] "#b…"

class(tags_vector[5])
#[1] "character"

I am trying to remove the ellipsis from this 5th element using gsub, using the code ,

gsub("[…]", "", tags_vector[5])
#[1] "#b…"

This code doesn't works and i get "#b…" as output. But in the same code when i put the value of 5th element directly, it works fine as below,

gsub("[…]", "", "#b…")
#[1] "#b"

I even tried putting the value of tags_vector[5] in a variable x1 and tried to use it in gsub() code but it still din't work.

Upvotes: 3

Views: 1353

Answers (1)

takje
takje

Reputation: 2800

It might be a Unicode issue. In R(studio), not all characters are created equally.

I tried to create a reproducible example:

# create the ellipsis from the definition (similar to your tags_used)
> ell_def <- rawToChar(as.raw(c('0xE2','0x80','0xA6'))) # from the unicode definition here: http://www.fileformat.info/info/unicode/char/2026/index.htm
> Encoding(ell_def) <- 'UTF-8'
> ell_def
[1] "…"
> Encoding(ell_def)
[1] "UTF-8"

# create the ellipsis from text (similar to your string)
> ell_text <- '…'
> ell_text
[1] "…"
> Encoding(ell_text)
[1] "latin1"

# show that you can get strange results
> gsub(ell_text,'',ell_def)
[1] "…"

The reproducibility of this example might be dependent on your locale. In my case, I work in windows-1252 since you cannot set the locale to UTF-8 in Windows. According to this stringi source, "R lets strings in ASCII, UTF-8, and your platform's native encoding coexist peacefully". As the example above shows, this might sometimes give contradictory results.

Basically, the output you see looks the same, but isn't on a byte level.

If I run this example in the R terminal, I get similar results, but apparently, it shows the ellipsis as a dot: ".".

A quick fix for your example would be to use the ellipsis definition in your gsub. E.g.:

gsub(ell_def,'',tags_vector[5])

Upvotes: 2

Related Questions