Deena
Deena

Reputation: 6223

Regular expression for non-english characters

I need to check if some strings contain any non-English characters.

x = c('Kält', 'normal', 'normal with, punctuation ~-+!', 'normal with number 1234')
grep(pattern = ??, x) # Expected output:1

Upvotes: 4

Views: 1036

Answers (2)

jophuh
jophuh

Reputation: 321

Expanding on the answer that's already been provided

To check for non-ASCII

x = c('Kält', 'normal', 'normal punctuation ~-+!', 'normal number 1234')
grep(pattern = "[^[:ascii:]]", x, perl=TRUE) 
grep(pattern = "[^[:ascii:]]", x, value=TRUE, perl=TRUE) 

To check for non-unicode

x = c('Kält', 'normal', 'normal punctuation ~-+!', 'normal number 1234')
grep(pattern = "[^\u0001-\u007F]+", x, perl=TRUE) 
grep(pattern = "[^\u0001-\u007F]+", x, value=TRUE, perl=TRUE) 

you can also use the stringi package to determine if a string is ASCII

x = c('Kält', 'normal', 'normal punctuation ~-+!', 'normal number 1234')
stringi::stri_enc_isascii(x)

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627536

You may use [^[:ascii:]] PCRE regex:

x = c('Kält', 'normal', 'normal with, punctuation ~-+!', 'normal with number 1234')
grep(pattern = "[^[:ascii:]]", x, perl=TRUE) 
grep(pattern = "[^[:ascii:]]", x, value=TRUE, perl=TRUE) 

Ouput:

[1] 1
[1] "Kält"

See the R demo

Upvotes: 5

Related Questions