Vitalijs
Vitalijs

Reputation: 950

Converting ASCII to UTF-8 stringi in R

I have the following problem:

library(stringi)
x_1<-"P N001361/01"
x_2<-"Р N001361/01"
x_1==x_2
[1] FALSE

> stri_enc_mark(x_1)
[1] "ASCII"
> stri_enc_mark(x_2)
[1] "UTF-8"

Then I try:

stri_encode(x_1,"ASCII","UTF-8",to_raw=FALSE)==x_2

But this still does not work. Maybe somebody can suggest how to make those two strings identical (I am trying to merge x_1 by x_2).

Upvotes: 1

Views: 2889

Answers (1)

amatsuo_net
amatsuo_net

Reputation: 2448

The problem is not about conversion. The issue is the first letter of x_2 is https://unicode-table.com/en/0420/.

That is clear when you run:

> stri_encode(x_2,"UTF-8", "ASCII",to_raw=FALSE)
[1] "\032 N001361/01"
Warning message:
In stri_encode(x_2, "UTF-8", "ASCII", to_raw = FALSE) :
  the Unicode codepoint \U00000420 cannot be converted to destination encoding

Therefore you need to explicitly convert the character to actual letter "P"

x_2_rep <- stri_replace_all_regex(x_2, parse(text = '\U00000420'), "P")
x_1 == x_2_rep
## TRUE

Upvotes: 2

Related Questions