Reputation: 4839
I am using R in Ubuntu, and trying to go over list of files, some of them i need and some of them i don't need,
I try to get the one's i need by finding a sub string in them, that need to appear exactly once,
i am using the function grep, that i found here grep function in r
and using the regex rules that i found here regex rules
and when taking the simple example
a <- c("a","aa")
grep("a{1}", a)
i would expect to get only the strings that contain "a" exactly one time, and instead of it i get both of them.
when i use the 2 instead of 1, i do get the wanted result of one strings (the one that contains "aa")
i can't use $ because this is not the end of the word for the words i need, for example i need to take those two words "germ-pass.tab", "germ-pass_germ-pass.tab" and return only the first that contains "germ-pass" once and once only
i cant use ^a because i don't need words such as "aca"
Thanks.
Upvotes: 2
Views: 5291
Reputation: 39657
In base you can find a string that contains a sub string exactly once when you remove the sub-string with gsub
and test if the remaining string lenght is equal to the searched sub string:
s <- c("a", "aa", "aca", "", "b", "ba", "ab", "cac", "abab", "ab-ab", NA)
ss <- "a" #Substring to find exactly once
nchar(s) - nchar(gsub(ss, "", s)) == nchar(ss)
#[1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE NA
or you count the hits of gregexpr
sapply(gregexpr(ss, s), function(x) sum(x>0)) == 1
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE NA
or as @sebastian-c already mentioned
lengths(regmatches(s, gregexpr(ss, s))) == 1
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
or with two grepl
one asking if the sub string is present one time the other if it is there two times:
!grepl("(.*a){2}", s) & grepl("a", s)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
or the same explained in one regex, where (?!(.*a){2})
is a non consuming (zero-width) negative lookahead
grepl("^(?!(.*a){2}).*a.*$", s, perl=TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
or more general, in case you want to change it to find the sub-string exactly n times
!grepl("(.*a){2}", s) & grepl("(.*a){1}", s)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
grepl("^(?!(.*a){2})(.*a){1}.*$", s, perl=TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
In case you are looking only for one character you can use the solution form @wiktor-stribiżew
grepl("^[^a]*a[^a]*$", s)
# [1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
Upvotes: 0
Reputation: 15395
It looks like you're after strings with one a
and no more, regardless where in the string. While stringi
can accomplish the task, a base solution would be:
s <- c("a", "aa", "aca", "", "b", "ba", "ab")
m <- gregexpr("a", s)
s[lengths(regmatches(s, m)) == 1]
[1] "a" "ba" "ab"
Alternatively, a regex-lite approach:
s[vapply(strsplit(s, ""), function(x) sum(x == "a") == 1, logical(1))]
[1] "a" "ba" "ab"
Upvotes: 3
Reputation: 78792
We can use stringi::stri_count
:
library(stringi)
library(purrr)
# simulate some data
set.seed(1492)
(map_chr(1:10, function(i) {
paste0(sample(letters, sample(10:30), replace=TRUE), collapse="")
}) -> strings)
## [1] "jdpcypoizdzvfzs" "gyvcljnfmrzmdmkufq"
## [3] "xqwrmnklbixnccwyaiadrsxn" "bwbenawcwvdevmjfvs"
## [5] "ytzwnpkuromfbklfsdnbwwnlrw" "wclxpzftqgwxyetpsuslgohcdenuj"
## [7] "czkhanefss" "mxsrqrackxvimcxqcqsditrou"
## [9] "ysqshvzjjmwes" "yzawyoqxqxiasensorlenafcbk"
# How many "w"s in each string?
stri_count_regex(strings, "w{1}")
## [1] 0 0 2 3 4 2 0 0 1 1
Upvotes: 2
Reputation: 626758
a
but not aa
You can use the following TRE regex:
^[^a]*a[^a]*$
It matches the start of the string (^
), 0+ chars other than a
([^a]*
), an a
, again 0+ non-'a's and the end of string ($
). See this IDEONE demo:
a <- c("aca","cac","a", "abab", "ab-ab", "ab-cc-ab")
grep("^[^a]*a[^a]*$", a, value=TRUE)
## => [1] "cac" "a"
a
but not aa
If you need to match words that have one a
only, but not two or more a
s inside in any location.
Use this PCRE regex:
\b(?!\w*a\w*a)\w*a\w*\b
See this regex demo.
Explanation:
\b
- word boundary(?!\w*a\w*a)
- a negative lookahead failing the match if there are 0+ word chars, a
, 0+ word chars and a
again right after the word boundary\w*
- 0+ word charsa
- an a
\w*
- 0+ word chars\b
- trailing word boundary.NOTE: Since \w
matches letters, digits and underscores, you might want to change it to \p{L}
or [^\W\d_]
(only matches letters).
See this demo:
a <- c("aca","cac","a")
grep("\\b(?!\\w*a\\w*a)\\w*a\\w*\\b", a, perl=TRUE, value=TRUE)
## => [1] "cac" "a"
Upvotes: 3
Reputation: 24074
As I said in comments, grep
looks for a pattern inside your string and there is indeed "a" (or "a{1}", which is the same for grep
) in "aa". You need to add to the pattern that the "a" is followed by not a : "a[^a]"
:
grep("a[^a]", c("aa", "ab"), value=TRUE)
#[1] "ab"
EDIT
Considering your specific problem, it seems you can try by the "opposite" : filter out the strings that contains more than one occurence of the pattern, using a "capture" of the pattern:
!grepl("(ab).+\\1", c("ab.t", "ab-ab.t"))
#[1] TRUE FALSE
!grepl("(ab).*\\1", c("ab", "ab-ab","ab-cc-ab", "abab"))
#[1] TRUE FALSE FALSE FALSE
The brackets permit to capture the pattern (here ab
but it can be any regex), the .*
is for "anything" zero or more times and the \\1
asks for a repeat of the captured pattern
Upvotes: 3
Reputation: 887088
we can try with ^
and $
to make sure that there is only a single 'a' in the string
grep("^a$", a)
#[1] 1
It is not clear what the OP wanted.
Upvotes: 1