Brandon Loudermilk
Brandon Loudermilk

Reputation: 970

How to determine if character string contains non-Roman characters in R

What is the preferred way of determining if a string contains non-Roman/non-English (e.g., ないでさ) characters?

Upvotes: 2

Views: 2164

Answers (2)

IRTFM
IRTFM

Reputation: 263479

You could use regex/grep to check for hex values of characters outside the range of printable ASCII characters:

x <- 'ないでさ'
grep( "[^\x20-\x7F]",x )
#[1] 1
grep( "[^\x20-\x7F]","Normal text" )
#integer(0)

If you wanted to allow the non-printing ("control") character to be considered "English", you could extend the range of the character class in hte first argument to grep to start with "\x01". See ?regex for more information on using character class argumets. See ?Quotes for more information about how to specify characters as Unicode, hexadecimal, or octal values.

The R.oo package has conversion functions that may be useful:

library(R.oo)
?intToChar
?charToInt

The fact that Henrik Bengtsson saw fit to include these in his package says to me that there is no a handy method to do this in base/default R. He's a long-time useR/guRu.

Seeing the other answer prompted this effort which seems straight-forward:

> is.na( iconv( c(x, "OrdinaryASCII") , "", "ASCII") )
[1]  TRUE FALSE

Upvotes: 7

Miha
Miha

Reputation: 2884

You could determine if string contains non-Latin/non-ASCII characters with iconv and grep

# My example, because you didn't add your data
characters <- c("ないでさ,  satisfação, катынь, Work, Awareness, Potential, für")
# First you convert string to vector of words
characters.unlist <- unlist(strsplit(characters, split=", "))
# Then find indices of words with non-ASCII characters using ICONV
characters.non.ASCII <- grep("characters.unlist", iconv(characters.unlist, "latin1", "ASCII", sub="characters.unlist"))
# subset original vector of words to exclude words with non-ASCII characters
data <- characters.unlist[-characters.non.ASCII]
# convert vector back to a string
dat.1 <- paste(data, collapse = ", ")

# Now if you run 
characters.non.ASCII
[1] 1 2 3 7 

That means that the first, second, third and seventh indices are non-ASCII characters, in my case 1, 2, 3 and 7 correspond to: "ないでさ, satisfação, катынь and für.

You could also run

dat.1 #and the output will be all ASCII charaters
[1] "Work, Awareness, Potential"

Upvotes: 5

Related Questions