anon01
anon01

Reputation: 23

R to detect accent

Is there a way using grepl or another function to detect all words that have accent? Not ignoring it, which has been ask many times, just to detect all the words that have any accent in it.

Thanks

Upvotes: 2

Views: 471

Answers (3)

jpsmith
jpsmith

Reputation: 17450

In base R you could try:

data

txt <- c("aaaaaaaaaä", "cccccccç", "ccccccč", "abc", "nnnnnñ")
# fourth position doesn't have any accent

Find positions in vector:

grep("[\x7f-\xff]", txt)
# [1] 1 2 3 5

or boolean (TRUE/FALSE)

grepl("[\x7f-\xff]", txt)
# [1]  TRUE  TRUE  TRUE FALSE  TRUE

And to subset data:

# Only with accents
txt[grepl("[\x7f-\xff]", txt)]
# [1] "aaaaaaaaaä" "cccccccç"   "ccccccč"    "nnnnnñ"  

# Only without accents
txt[!grepl("[\x7f-\xff]", txt)]
#[1] "abc"

# could also use `grep()` instead of `grepl()` here

Upvotes: 2

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21432

Another solution - detect non-ASCII characters:

library(stringr)
str_detect(txt, "[^ -~]")
[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE

where [^ -~] is a negated character class for ASCII characters (so, without negation, [ -~] matches any ASCII characters)

Or, using dplyr syntax:

library(dplyr)
library(stringr)
data.frame(txt) %>%
  filter(str_detect(txt, "[^ -~]"))
         txt
1 aaaaaaaaaä
2   cccccccç
3    ccccccč
4     nnnnnñ
5       ynàn

Data:

txt <- c("aaaaaaaaaä", "cccccccç", "ccccccč", "abc", "nnnnnñ", "xXXXz", "ynàn")

Upvotes: 2

user2554330
user2554330

Reputation: 44907

You can use tools::showNonASCII to detect characters that aren't in the ASCII set. That includes accented characters as well as some symbols and characters from other alphabets:

x <- c("aaaaaaaaaä", 
       "cccccccç", 
       "ccccccč", 
       "abc", 
       "€", 
       "$")

tools::showNonASCII(x)
#> 1: aaaaaaaaa<c3><a4>
#> 2: ccccccc<c3><a7>
#> 3: cccccc<c4><8d>
#> 5: <e2><82><ac>

Created on 2022-10-12 with reprex v2.0.2

Upvotes: 2

Related Questions