thebeancounter
thebeancounter

Reputation: 4839

r grep by regex - finding a string that contains a sub string exactly one once

I am using R in Ubuntu, and trying to go over list of files, some of them i need and some of them i don't need,

I try to get the one's i need by finding a sub string in them, that need to appear exactly once,

i am using the function grep, that i found here grep function in r

and using the regex rules that i found here regex rules

and when taking the simple example

a <- c("a","aa") 
grep("a{1}", a) 

i would expect to get only the strings that contain "a" exactly one time, and instead of it i get both of them.

when i use the 2 instead of 1, i do get the wanted result of one strings (the one that contains "aa")

i can't use $ because this is not the end of the word for the words i need, for example i need to take those two words "germ-pass.tab", "germ-pass_germ-pass.tab" and return only the first that contains "germ-pass" once and once only

i cant use ^a because i don't need words such as "aca"

Thanks.

Upvotes: 2

Views: 5291

Answers (6)

GKi
GKi

Reputation: 39657

In base you can find a string that contains a sub string exactly once when you remove the sub-string with gsub and test if the remaining string lenght is equal to the searched sub string:

s <- c("a", "aa", "aca", "", "b", "ba", "ab", "cac", "abab", "ab-ab", NA)
ss  <- "a" #Substring to find exactly once

nchar(s) - nchar(gsub(ss, "", s)) == nchar(ss)
#[1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    NA

or you count the hits of gregexpr

sapply(gregexpr(ss, s), function(x) sum(x>0)) == 1
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE    NA

or as @sebastian-c already mentioned

lengths(regmatches(s, gregexpr(ss, s))) == 1
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

or with two grepl one asking if the sub string is present one time the other if it is there two times:

!grepl("(.*a){2}", s) & grepl("a", s)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

or the same explained in one regex, where (?!(.*a){2}) is a non consuming (zero-width) negative lookahead

grepl("^(?!(.*a){2}).*a.*$", s, perl=TRUE)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

or more general, in case you want to change it to find the sub-string exactly n times

!grepl("(.*a){2}", s) & grepl("(.*a){1}", s)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

grepl("^(?!(.*a){2})(.*a){1}.*$", s, perl=TRUE)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

In case you are looking only for one character you can use the solution form @wiktor-stribiżew

grepl("^[^a]*a[^a]*$", s)
# [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

Upvotes: 0

sebastian-c
sebastian-c

Reputation: 15395

It looks like you're after strings with one a and no more, regardless where in the string. While stringi can accomplish the task, a base solution would be:

s <- c("a", "aa", "aca", "", "b", "ba", "ab")

m <- gregexpr("a", s)
s[lengths(regmatches(s, m)) == 1]

[1] "a"  "ba" "ab"

Alternatively, a regex-lite approach:

s[vapply(strsplit(s, ""), function(x) sum(x == "a") == 1, logical(1))]
[1] "a"  "ba" "ab"

Upvotes: 3

hrbrmstr
hrbrmstr

Reputation: 78792

We can use stringi::stri_count:

library(stringi)
library(purrr)

# simulate some data
set.seed(1492)
(map_chr(1:10, function(i) {
  paste0(sample(letters, sample(10:30), replace=TRUE), collapse="")
}) -> strings)

## [1] "jdpcypoizdzvfzs"               "gyvcljnfmrzmdmkufq"           
## [3] "xqwrmnklbixnccwyaiadrsxn"      "bwbenawcwvdevmjfvs"           
## [5] "ytzwnpkuromfbklfsdnbwwnlrw"    "wclxpzftqgwxyetpsuslgohcdenuj"
## [7] "czkhanefss"                    "mxsrqrackxvimcxqcqsditrou"    
## [9] "ysqshvzjjmwes"                 "yzawyoqxqxiasensorlenafcbk" 

# How many "w"s in each string?
stri_count_regex(strings, "w{1}")

## [1] 0 0 2 3 4 2 0 0 1 1

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

Detecting strings with a but not aa

You can use the following TRE regex:

^[^a]*a[^a]*$

It matches the start of the string (^), 0+ chars other than a ([^a]*), an a, again 0+ non-'a's and the end of string ($). See this IDEONE demo:

a <- c("aca","cac","a", "abab", "ab-ab", "ab-cc-ab")
grep("^[^a]*a[^a]*$", a, value=TRUE)
## => [1] "cac" "a"

Finding Whole Word Containing a but not aa

If you need to match words that have one a only, but not two or more as inside in any location.

Use this PCRE regex:

\b(?!\w*a\w*a)\w*a\w*\b

See this regex demo.

Explanation:

  • \b - word boundary
  • (?!\w*a\w*a) - a negative lookahead failing the match if there are 0+ word chars, a, 0+ word chars and a again right after the word boundary
  • \w* - 0+ word chars
  • a - an a
  • \w* - 0+ word chars
  • \b - trailing word boundary.

NOTE: Since \w matches letters, digits and underscores, you might want to change it to \p{L} or [^\W\d_] (only matches letters).

See this demo:

a <- c("aca","cac","a")
grep("\\b(?!\\w*a\\w*a)\\w*a\\w*\\b", a, perl=TRUE, value=TRUE)
## => [1] "cac" "a"  

Upvotes: 3

Cath
Cath

Reputation: 24074

As I said in comments, grep looks for a pattern inside your string and there is indeed "a" (or "a{1}", which is the same for grep) in "aa". You need to add to the pattern that the "a" is followed by not a : "a[^a]":

grep("a[^a]", c("aa", "ab"), value=TRUE)
#[1] "ab"

EDIT

Considering your specific problem, it seems you can try by the "opposite" : filter out the strings that contains more than one occurence of the pattern, using a "capture" of the pattern:

!grepl("(ab).+\\1", c("ab.t", "ab-ab.t"))
#[1]  TRUE FALSE

!grepl("(ab).*\\1", c("ab", "ab-ab","ab-cc-ab", "abab"))
#[1]  TRUE FALSE FALSE FALSE

The brackets permit to capture the pattern (here ab but it can be any regex), the .* is for "anything" zero or more times and the \\1 asks for a repeat of the captured pattern

Upvotes: 3

akrun
akrun

Reputation: 887088

we can try with ^ and $ to make sure that there is only a single 'a' in the string

grep("^a$", a)
#[1] 1

It is not clear what the OP wanted.

Upvotes: 1

Related Questions