Reputation: 136
So I want this dataframe/string/vector
x<-c("WB (16)","CT (14)WB (15)","NBIO (15)","CT (12)CITG-TP (17)","BK (11)PS (15)BK-AR (15)")
to look like this
x<-
WB
WB
NBIO
CITG-TP
BK-AR
So I want to extract the last or only value (which is a word together with its year so eg. WB(15) is one value) and then remove the year with its brackets. I tried doing this with sub(".*?)", "", x)
but when there is only one entry, it will clear this too as shown now:
c( "", "WB (15)" , "" , "CITG-TP (17)","PS (15)BK-AR (15)")
How can I do this?
Upvotes: 0
Views: 103
Reputation: 8110
I strongly doubt this is the most efficient regex, but this gets you the exact output you're looking for:
library(stringr)
str_replace_all(x, "CT\\s\\(\\d+\\)|BK\\s\\(\\d+\\)|PS\\s\\(\\d+\\)|\\s\\(\\d+\\)","")
[1] "WB" "WB" "NBIO" "CITG-TP" "BK-AR"
I played around with is some more and this looks also works.
str_replace_all(x, "\\s\\(\\d+\\)|CT|PS|BK(?=\\s)","")
[1] "WB" "WB" "NBIO" "CITG-TP" "BK-AR"
Here is a more general approach
strReverse <- function(x){
sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
}
strReverse(str_extract(strReverse(x),"(?<=\\(\\s).*?(?=(\\)|$))"))
[1] "WB" "WB" "NBIO" "CITG-TP" "BK-AR"
I'm there is probably some way to select the last occurrence of a pattern, but I was having some trouble with that, so I defined a function to reverse the string and take the first occurrence of pattern and then we just put the string back in the correct order order.
Upvotes: 1
Reputation: 79208
sub(".*?([^)]+)\\s\\(\\d+\\)$","\\1",x)
[1] "WB" "WB" "NBIO" "CITG-TP" "BK-AR"
Upvotes: 1
Reputation: 1001
This will remove numbers between quotation marks and them select the last code in each string. I'm using pipes (%>%
) to leave the code cleaner.
library(magrittr) # pipe operators
newx <-
x %>%
gsub('[[:blank:]]\\([[:digit:]]*\\)', ';', .) %>% # change all " (NN)" to ";"
strsplit(split = ';') %>% # split the strings into a list
lapply(rev) %>% # revert the order
lapply('[[', 1) %>% # select first element
unlist() # change back to vector
> newx
[1] "WB" "WB" "NBIO" "CITG-TP" "BK-AR"
Upvotes: 2