Reputation: 371
If I have the following data frame:
> testdf
id test1 test2 test3 annot
1 A 10 0.96093147 0.9996099 A test 1 a
2 B 20 0.48922459 0.8225223 B-test-1-b
3 C 30 0.76671324 0.4031212 Ctest1c
4 D 40 0.92446136 0.7729842 ATEST
5 E 50 0.71542366 0.6379789 ATEST1
6 F 60 0.09695642 0.6293565 BTEST1
7 G 70 0.30056727 0.7030828 CTEST2
8 H 80 0.23391326 0.2112124 DTEST3
And I would like to filter using dplyr, alongside other method (like regex), to filter "annot" column to retain only the cells where any characters present are upper-case (numbers can be ignored for filtering purposes) to obtain the following:
> testdf
id test1 test2 test3 annot
1 D 40 0.92446136 0.7729842 ATEST
2 E 50 0.71542366 0.6379789 ATEST1
3 F 60 0.09695642 0.6293565 BTEST1
4 G 70 0.30056727 0.7030828 CTEST2
5 H 80 0.23391326 0.2112124 DTEST3
I have tried several times but cannot get the combination right.
here is the data for this example data frame:
> dput(testdf)
structure(list(id = c("A", "B", "C", "D", "E", "F", "G", "H"),
test1 = c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L), test2 = c(0.960931471,
0.489224595, 0.766713238, 0.924461365, 0.715423658, 0.096956416,
0.300567271, 0.23391326), test3 = c(0.999609908, 0.82252227,
0.403121205, 0.772984196, 0.637978895, 0.629356488, 0.703082753,
0.211212439), annot = c("A test 1 a", "B-test-1-b", "Ctest1c",
"ATEST", "ATEST1", "BTEST1", "CTEST2", "DTEST3")), class = "data.frame", row.names = c(NA,
-8L))
Upvotes: 0
Views: 1866
Reputation: 1364
Another solution using base R
df[!grepl("[a-z]", df$annot),]
# id test1 test2 test3 annot
# 4 D 40 0.92446136 0.7729842 ATEST
# 5 E 50 0.71542366 0.6379789 ATEST1
# 6 F 60 0.09695642 0.6293565 BTEST1
# 7 G 70 0.30056727 0.7030828 CTEST2
# 8 H 80 0.23391326 0.2112124 DTEST3
Upvotes: 0
Reputation: 173858
You can use base R's grepl
as a logical test for strings containing a particular regex, and use that inside a dplyr::filter
.
The regex for any uppercase letter is "[[:upper:]]"
testdf %>% filter(grepl("[[:upper:]]", annot))
#> id test1 test2 test3 annot
#> 1 D 40 0.92446136 0.7729842 ATEST
#> 2 E 50 0.71542366 0.6379789 ATEST1
#> 3 F 60 0.09695642 0.6293565 BTEST1
#> 4 G 70 0.30056727 0.7030828 CTEST2
#> 5 H 80 0.23391326 0.2112124 DTEST3
However, in your example, all the rows in testdf
contain at least one uppercase letter (the first three cells have uppercase letters in the first position). So, to demonstrate, I have changed the first letter of each of the first three cells in annot
to lowercase:
testdf <- structure(list(id = c("A", "B", "C", "D", "E", "F", "G", "H"),
test1 = c(10L, 20L, 30L, 40L, 50L, 60L, 70L, 80L), test2 = c(0.960931471,
0.489224595, 0.766713238, 0.924461365, 0.715423658, 0.096956416,
0.300567271, 0.23391326), test3 = c(0.999609908, 0.82252227,
0.403121205, 0.772984196, 0.637978895, 0.629356488, 0.703082753,
0.211212439), annot = c("a test 1 a", "b-test-1-b", "ctest1c",
"ATEST", "ATEST1", "BTEST1", "CTEST2", "DTEST3")),
class = "data.frame", row.names = c(NA, -8L))
testdf
#> id test1 test2 test3 annot
#> 1 A 10 0.96093147 0.9996099 a test 1 a
#> 2 B 20 0.48922459 0.8225223 b-test-1-b
#> 3 C 30 0.76671324 0.4031212 ctest1c
#> 4 D 40 0.92446136 0.7729842 ATEST
#> 5 E 50 0.71542366 0.6379789 ATEST1
#> 6 F 60 0.09695642 0.6293565 BTEST1
#> 7 G 70 0.30056727 0.7030828 CTEST2
#> 8 H 80 0.23391326 0.2112124 DTEST3
Upvotes: 1
Reputation: 1928
This dplyr
answer should suffice: filter annot
for values that contain uppercase letters, but also don't contain lowercase letters. Numbers ignored, but this would also not include annot
values that are all numbers.
new_df <- testdf %>%
filter(str_detect(annot, '[:upper:]') & !str_detect(annot, '[:lower:]'))
id test1 test2 test3 annot
1 D 40 0.92446136 0.7729842 ATEST
2 E 50 0.71542366 0.6379789 ATEST1
3 F 60 0.09695642 0.6293565 BTEST1
4 G 70 0.30056727 0.7030828 CTEST2
5 H 80 0.23391326 0.2112124 DTEST3
Upvotes: 1