Reputation: 335
I am trying to identify unique unicode values in a data frame composed of character strings. I have tried using the grep function, however I encounter the following error
Error: '\U' used without hex digits in character string starting ""\U"
A example data frame
time sender message
1 2012-12-04 13:40:00 1 Hello handsome!
2 2012-12-04 13:40:08 1 \U0001f618
3 2012-12-04 14:39:24 1 \U0001f603
4 2012-12-04 16:04:25 2 <image omitted>
73 2012-12-05 06:02:17 1 Haha not white and blue... White with blue eyes \U0001f61c
40619 2015-05-08 10:00:58 1 \U0001f631\U0001f637
grep("\U", dat$messages)
data
dat <-
structure(list(time = c("2012-12-04 13:40:00", "2012-12-04 13:40:08",
"2012-12-04 14:39:24", "2012-12-04 16:04:25", "2012-12-05 06:02:17",
"2015-05-08 10:00:58"), sender = c(1L, 1L, 1L, 2L, 1L, 1L), message = c("Hello handsome!",
"\U0001f618", "\U0001f603", "<image omitted>", "Haha not white and blue... White with blue eyes \U0001f61c",
"\U0001f631\U0001f637")), .Names = c("time", "sender", "message"
), class = "data.frame", row.names = c("1", "2", "3", "4", "73",
"40619"))
Upvotes: 5
Views: 4103
Reputation: 21621
Try:
library(stringi)
stri_enc_isascii(dat$message)
Which gives:
# [1] TRUE FALSE FALSE TRUE FALSE FALSE
Upvotes: 5
Reputation: 206197
I'm assuming by "unicode character" you just mean non-ASCII characters. Character codes can mean different things depending on encodings. R represents values outside of the current encoding with a special \U
sequence. Note that neither the slash nor the letter "U" actually appear in the real data. This is just how they are escaped to be printed onscreen when the appropriate glyph isn't available.
For example, even though the last message looks like it's long, it's actually only two characters long
dat$message[6]
# [1] "\U0001f631\U0001f637"
nchar(dat$message[6])
# [1] 2
You can find non-ASCII codes using regular expressions pretty easily. ASCII characters all have codes 0-128 (or 000 to 177 in octal). You can find values outside that range with
grep("[^\001-\177]", dat$message)
# [1] 2 3 5 6
Upvotes: 10