Reputation: 5167
I'm trying to get count the instances of 3 consecutive "a" events, "aaa"
.
The string will comprise the lower alphabet, e.g. "abaaaababaaa"
I tried the following piece of code. But the behavior is not precisely what I am looking for.
x<-"abaaaababaaa";
gregexpr("aaa",x);
I would like the match to return 3 instances of the "aaa" occurrence as opposed to 2.
Assume indexation begins with 1
Upvotes: 5
Views: 4221
Reputation: 44614
Here is a way to extract all overlapping matches of varying length using gregexpr
.
x<-"abaaaababaaa"
# nest in lookahead + capture group
# to get all instances of the pattern "(ab)|b"
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# regmatches will reference the match.length attr. to extract the strings
# so move match length data from 'capture.length' to 'match.length' attr
attr(matches[[1]], 'match.length') <- as.vector(attr(matches[[1]], 'capture.length')[,1])
# extract substrings
regmatches(x, matches)
# [[1]]
# [1] "ab" "b" "ab" "b" "ab" "b"
The trick is to surround the pattern in a capture group and that capture group in a lookahead assertion. gregexpr
will return a list containing the start positions with an attribute capture.length
, a matrix where the first column is the match lengths of the first capture group. If you convert this into a vector and move it into the match.length
attribute (which is all zeros, since the entire pattern was inside a lookahead assertion), you can pass it to regmatches
to extract the strings.
As hinted by the type of the final result, with a few modifications, this can be vectorized, for the case when x
is a list of strings.
x<-list(s1="abaaaababaaa", s2="ab")
matches<-gregexpr('(?=((ab)|b))', x, perl=TRUE)
# make a function that replaces match.length attr with capture.length
set.match.length<-
function(x) structure(x, match.length=as.vector(attr(x, 'capture.length')[,1]))
# set match.length to capture.length for each match object
matches<-lapply(matches, set.match.length)
# extract substrings
mapply(regmatches, x, lapply(matches, list))
# $s1
# [1] "ab" "b" "ab" "b" "ab" "b"
#
# $s2
# [1] "ab" "b"
Upvotes: 0
Reputation: 7928
I know I'm late, but I wanted to share this solution,
your.string <- "abaaaababaaa"
nc1 <- nchar(your.string)-1
x <- unlist(strsplit(your.string, NULL))
x2 <- c()
for (i in 1:nc1)
x2 <- c(x2, paste(x[i], x[i+1], x[i+2], sep=""))
cat("ocurrences of <aaa> in <your.string> is,",
length(grep("aaa", x2)), "and they are at index", grep("aaa", x2))
> ocurrences of <aaa> in <your.string> is, 3 and they are at index 3 4 10
Heavily inspired by this answer from R-help by Fran.
Upvotes: 1
Reputation: 60080
To catch the overlapping matches, you can use a lookahead like this:
gregexpr("a(?=aa)", x, perl=TRUE)
However, your matches are now just a single "a", so it might complicate further processing of these matches, especially if you're not always looking for fixed-length patterns.
Upvotes: 6