Reputation: 970
What is the preferred way of determining if a string contains non-Roman/non-English (e.g., ないでさ) characters?
Upvotes: 2
Views: 2164
Reputation: 263479
You could use regex/grep to check for hex values of characters outside the range of printable ASCII characters:
x <- 'ないでさ'
grep( "[^\x20-\x7F]",x )
#[1] 1
grep( "[^\x20-\x7F]","Normal text" )
#integer(0)
If you wanted to allow the non-printing ("control") character to be considered "English", you could extend the range of the character class in hte first argument to grep
to start with "\x01". See ?regex
for more information on using character class argumets. See ?Quotes
for more information about how to specify characters as Unicode, hexadecimal, or octal values.
The R.oo package has conversion functions that may be useful:
library(R.oo)
?intToChar
?charToInt
The fact that Henrik Bengtsson saw fit to include these in his package says to me that there is no a handy method to do this in base/default R. He's a long-time useR/guRu.
Seeing the other answer prompted this effort which seems straight-forward:
> is.na( iconv( c(x, "OrdinaryASCII") , "", "ASCII") )
[1] TRUE FALSE
Upvotes: 7
Reputation: 2884
You could determine if string contains non-Latin/non-ASCII characters with iconv
and grep
# My example, because you didn't add your data
characters <- c("ないでさ, satisfação, катынь, Work, Awareness, Potential, für")
# First you convert string to vector of words
characters.unlist <- unlist(strsplit(characters, split=", "))
# Then find indices of words with non-ASCII characters using ICONV
characters.non.ASCII <- grep("characters.unlist", iconv(characters.unlist, "latin1", "ASCII", sub="characters.unlist"))
# subset original vector of words to exclude words with non-ASCII characters
data <- characters.unlist[-characters.non.ASCII]
# convert vector back to a string
dat.1 <- paste(data, collapse = ", ")
# Now if you run
characters.non.ASCII
[1] 1 2 3 7
That means that the first, second, third and seventh indices are non-ASCII characters, in my case 1, 2, 3 and 7 correspond to: "ないでさ, satisfação, катынь and für.
You could also run
dat.1 #and the output will be all ASCII charaters
[1] "Work, Awareness, Potential"
Upvotes: 5