OganM
OganM

Reputation: 2663

Convert UTF-8 encoding in text form to characters

I have a character string that contains data in UTF-8 encoded form as plain text. Example

utf8 = "#C2#BD"

I am trying get the character for this value. In this case it would be "½"

If this was encoded using UTF-16, it would have been "00BD", I could have converted this into a character that is actually encoded into utf8 by doing

intToUtf8(strtoi('0x00BD'))
[1] "½"

However I cannot seem to find a way to get the integer value using the utf8 encoded hex "#C2#BD".

Ultimately I want to reach ½ from "#C2#BD". I suspect the path there goes from getting UTF-16 that is convertable into an integer by strtoi but I am having a hard time understanding the relationship between the two.

Upvotes: 0

Views: 641

Answers (1)

user2554330
user2554330

Reputation: 44788

This will do it for that example:

utf8chars <- strsplit(utf8, "#")

# just grab the first entry, and leave off the blank
utf8chars <- utf8chars[[1]][-1]

# Convert the hex to integer
utf8int <- strtoi(paste0("0x",utf8chars))

# Then to raw
utf8raw <- as.raw(utf8int)

# And finally to character
utf8char <- rawToChar(utf8raw)

# On Windows you'll also need this
Encoding(utf8char) <- "utf-8"

Real examples shouldn't require much in the way of changes...

Upvotes: 1

Related Questions