How to determine if character string contains non-Roman characters in R

Question

What is the preferred way of determining if a string contains non-Roman/non-English (e.g., ないでさ) characters?

Miha · Accepted Answer

You could determine if string contains non-Latin/non-ASCII characters with iconv and grep

# My example, because you didn't add your data
characters <- c("ないでさ,  satisfação, катынь, Work, Awareness, Potential, für")
# First you convert string to vector of words
characters.unlist <- unlist(strsplit(characters, split=", "))
# Then find indices of words with non-ASCII characters using ICONV
characters.non.ASCII <- grep("characters.unlist", iconv(characters.unlist, "latin1", "ASCII", sub="characters.unlist"))
# subset original vector of words to exclude words with non-ASCII characters
data <- characters.unlist[-characters.non.ASCII]
# convert vector back to a string
dat.1 <- paste(data, collapse = ", ")

# Now if you run 
characters.non.ASCII
[1] 1 2 3 7

That means that the first, second, third and seventh indices are non-ASCII characters, in my case 1, 2, 3 and 7 correspond to: "ないでさ, satisfação, катынь and für.

You could also run

dat.1 #and the output will be all ASCII charaters
[1] "Work, Awareness, Potential"

How to determine if character string contains non-Roman characters in R

Answers (2)

Related Questions