Reputation: 43
I would like to insert "&
" between letters (upper-case and lower-case), but not before or after letters, and replace each lower-case letter x
by tt$X==0
, each upper-case letter X
by tt$X==1
, and each +
by )|(
, plus an opening bracket and a closing bracket around the entire string, so as to get an expression that can be evaluated in R. For example, I have the string
st <- "AbC + de + FGHIJ"
The result should then look like this:
"(tt$A==1 & tt$B==0 & tt$C==1) | (tt$D==0 & tt$E==0) | (tt$F==1 & tt$G==1 & tt$H==1 & tt$I==1 & tt$J==1)"
Could I easily do this with the gsub()
function?
Upvotes: 4
Views: 187
Reputation: 6778
You can do this, but it's not very elegant
st <- "AbC + de + FGHIJ"
t1 <- gsub("([a-z])", "tt\\$\\U\\1==0", st, perl = TRUE)
t2 <- gsub("((?<!\\$)[A-Z])", "tt\\$\\U\\1==1", t1, perl = TRUE)
t3 <- gsub("([0-9])(tt)", "\\1 & \\2", t2)
t4 <- gsub(" + ", ") | (", t3, fixed = TRUE)
t5 <- paste("(", t4, ")", sep = "")
st
# "AbC + de + FGHIJ"
t5
# "(tt$A==1 & tt$B==0 & tt$C==1) | (tt$D==0 & tt$E==0) | (tt$F==1 & tt$G==1 & tt$H==1 & tt$I==1 & tt$J==1)"
Here's an explanation of what it does:
t1 replaces all lower case letters with tt$X==0
where X
is the uppercase letter that was replaced. The uppercase letter is produced with \\U\\1
where \\U
generates the uppercase and \\1
returns the first capture group. Capture groups are what are caught inside parentheses.
Now that the lowercase letters are out of the way (necessary to do first so that we don't replace tt
), we replace capital letters, but only if they're preceded by a $
. To tell gsub
to ignore capital letters after a dollar sign, we use the negative lookbehind (?<!)
and the \\$
tells it to ignore a dollar sign. We then again replace our letter with the uppercase letter we're replacing.
Next, we need to insert a space between all of the letters we replaced. The best way to do that is to just acknowledge that tt$
will be preceded by a digit everytime a space is needed. So we look for a digit followed by "tt" and replace it with the first capture group, " & ", followed by the second capture group.
Then we need to replace the "+" symbols. So we replace that and the whitespace around it with ") | (". We used fixed = TRUE
to avoid needing to escape the parentheses and the OR operator.
Lastly, we append the leading and trailing parentheses to give us a fully functioning conditional phrase.
Per the comments made in the other solution, we can make a couple changes to my proposed solution to a) make it more robust, and b) more flexible. To make it more robust, we simply change t4
so that it is now:
t4 <- gsub(" ?\\+ ?", ") | (", t3)
We simply add question marks after the spaces to say that there can be 0 or 1, escape the +
, and remove fixed = TRUE
. We have to remove fixed = TRUE
because we need the regex functions to test for a space or not.
To make it more flexible, we simply wrap it in a function that allows us to pass the string and our desired object name.
parse_string <- function(string, object_name) {
st <- string
t1 <- gsub("([a-z])", paste0(object_name, "\\$\\U\\1==0"), st, perl = TRUE)
t2 <- gsub("((?<!\\$)[A-Z])", paste0(object_name, "\\$\\U\\1==1"), t1, perl = TRUE)
t3 <- gsub(paste0("([0-9])(", object_name, ")"), "\\1 & \\2", t2)
t4 <- gsub(" ?\\+ ?", ") | (", t3)
t5 <- paste("(", t4, ")", sep = "")
return(t5)
}
> parse_string(st, "tt") == t5
# [1] TRUE
> parse_string(st, "foo")
# [1] "(foo$A==1 & foo$B==0 & foo$C==1) | (foo$D==0 & foo$E==0) | (foo$F==1 & foo$G==1 & foo$H==1 & foo$I==1 & foo$J==1)"
> parse_string("AbC+de+FGHIJ", "tt") == t5
# [1] TRUE
Upvotes: 3
Reputation: 94222
A bunch of regexps are rarely elegant, and often hard to debug. The above regexp solution fails if there's not that exact spacing between elements.
> tt("aBc+b")
[1] "(tt$A==0 & tt$B==1 & tt$C==0+tt$B==0)"
> tt("aBc + b")
[1] "(tt$A==0 & tt$B==1 & tt$C==0) | (tt$B==0)"
Sometimes you just have to split the bits up yourself and process them. Here's a solution:
doChar = Vectorize(
function(c){
sprintf("tt$%s==%s",toupper(c),ifelse(c %in% LETTERS,"1","0"))
}
)
doWord = Vectorize(function(W){
cs = strsplit(W,"")[[1]]
paste0("(",
paste(doChar(cs),collapse=" & "),
")")
})
processString = function(st){
parts = strsplit(st,"\\+")[[1]]
parts = gsub(" ","",parts)
paste0(doWord(parts),collapse=" | ")
}
There's probably many ways to make it better, but it has the benefit of being a bit easier to debug (you can test the parts) and looks less like line noise :)
For the sample string given it returns the same as the tt
function which is my function wrapper of the regexp solution:
> tt(st)==processString(st)
[1] TRUE
But handles spacing:
> processString("aBc + deF") == processString("aBc+deF")
[1] TRUE
Its always a good idea to write code that is a bit flexible in the inputs it accepts. You might also notice that the tt
part of the output elements appears only once, so if you want to output foo$A
instead of tt$A
there's only one change needed. The regexp solution has this in three places (or maybe four if I've missed one!).
Upvotes: 0