Reputation: 371

dplyr filter for cells that contain only uppercase characters

If I have the following data frame:

    > testdf
  id test1      test2     test3      annot
1  A    10 0.96093147 0.9996099 A test 1 a
2  B    20 0.48922459 0.8225223 B-test-1-b
3  C    30 0.76671324 0.4031212    Ctest1c
4  D    40 0.92446136 0.7729842      ATEST
5  E    50 0.71542366 0.6379789     ATEST1
6  F    60 0.09695642 0.6293565     BTEST1
7  G    70 0.30056727 0.7030828     CTEST2
8  H    80 0.23391326 0.2112124     DTEST3

And I would like to filter using dplyr, alongside other method (like regex), to filter "annot" column to retain only the cells where any characters present are upper-case (numbers can be ignored for filtering purposes) to obtain the following:

> testdf
  id test1      test2     test3  annot
1  D    40 0.92446136 0.7729842  ATEST
2  E    50 0.71542366 0.6379789 ATEST1
3  F    60 0.09695642 0.6293565 BTEST1
4  G    70 0.30056727 0.7030828 CTEST2
5  H    80 0.23391326 0.2112124 DTEST3

I have tried several times but cannot get the combination right.

here is the data for this example data frame:

> dput(testdf)
structure(list(id = c("A", "B", "C", "D", "E", "F", "G", "H"), 
    test1 = c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L), test2 = c(0.960931471, 
    0.489224595, 0.766713238, 0.924461365, 0.715423658, 0.096956416, 
    0.300567271, 0.23391326), test3 = c(0.999609908, 0.82252227, 
    0.403121205, 0.772984196, 0.637978895, 0.629356488, 0.703082753, 
    0.211212439), annot = c("A test 1 a", "B-test-1-b", "Ctest1c", 
    "ATEST", "ATEST1", "BTEST1", "CTEST2", "DTEST3")), class = "data.frame", row.names = c(NA, 
-8L))

Upvotes: 0

Answers (3)

Mike V

Reputation: 1364

Another solution using base R

df[!grepl("[a-z]", df$annot),]
# id test1      test2     test3  annot
# 4  D    40 0.92446136 0.7729842  ATEST
# 5  E    50 0.71542366 0.6379789 ATEST1
# 6  F    60 0.09695642 0.6293565 BTEST1
# 7  G    70 0.30056727 0.7030828 CTEST2
# 8  H    80 0.23391326 0.2112124 DTEST3

Upvotes: 0

Allan Cameron

Reputation: 173858

You can use base R's grepl as a logical test for strings containing a particular regex, and use that inside a dplyr::filter.

The regex for any uppercase letter is "[[:upper:]]"

testdf %>% filter(grepl("[[:upper:]]", annot))
#>   id test1      test2     test3  annot
#> 1  D    40 0.92446136 0.7729842  ATEST
#> 2  E    50 0.71542366 0.6379789 ATEST1
#> 3  F    60 0.09695642 0.6293565 BTEST1
#> 4  G    70 0.30056727 0.7030828 CTEST2
#> 5  H    80 0.23391326 0.2112124 DTEST3

However, in your example, all the rows in testdf contain at least one uppercase letter (the first three cells have uppercase letters in the first position). So, to demonstrate, I have changed the first letter of each of the first three cells in annot to lowercase:

testdf <- structure(list(id = c("A", "B", "C", "D", "E", "F", "G", "H"), 
    test1 = c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L), test2 = c(0.960931471, 
    0.489224595, 0.766713238, 0.924461365, 0.715423658, 0.096956416, 
    0.300567271, 0.23391326), test3 = c(0.999609908, 0.82252227, 
    0.403121205, 0.772984196, 0.637978895, 0.629356488, 0.703082753, 
    0.211212439), annot = c("a test 1 a", "b-test-1-b", "ctest1c", 
    "ATEST", "ATEST1", "BTEST1", "CTEST2", "DTEST3")), 
    class = "data.frame", row.names = c(NA, -8L))

testdf
#>   id test1      test2     test3      annot
#> 1  A    10 0.96093147 0.9996099 a test 1 a
#> 2  B    20 0.48922459 0.8225223 b-test-1-b
#> 3  C    30 0.76671324 0.4031212    ctest1c
#> 4  D    40 0.92446136 0.7729842      ATEST
#> 5  E    50 0.71542366 0.6379789     ATEST1
#> 6  F    60 0.09695642 0.6293565     BTEST1
#> 7  G    70 0.30056727 0.7030828     CTEST2
#> 8  H    80 0.23391326 0.2112124     DTEST3

Upvotes: 1

TTS

Reputation: 1928

This dplyr answer should suffice: filter annot for values that contain uppercase letters, but also don't contain lowercase letters. Numbers ignored, but this would also not include annot values that are all numbers.

new_df <- testdf %>%
  filter(str_detect(annot, '[:upper:]') & !str_detect(annot, '[:lower:]'))

  id test1      test2     test3  annot
1  D    40 0.92446136 0.7729842  ATEST
2  E    50 0.71542366 0.6379789 ATEST1
3  F    60 0.09695642 0.6293565 BTEST1
4  G    70 0.30056727 0.7030828 CTEST2
5  H    80 0.23391326 0.2112124 DTEST3

Upvotes: 1

dplyr filter for cells that contain only uppercase characters

Answers (3)

Related Questions