Leo
Leo

Reputation: 136

how to extract last values from dataframe and remove numbers?

So I want this dataframe/string/vector

 x<-c("WB (16)","CT (14)WB (15)","NBIO (15)","CT (12)CITG-TP (17)","BK (11)PS (15)BK-AR (15)")

to look like this

 x<-
    WB
    WB
    NBIO
    CITG-TP
    BK-AR

So I want to extract the last or only value (which is a word together with its year so eg. WB(15) is one value) and then remove the year with its brackets. I tried doing this with sub(".*?)", "", x) but when there is only one entry, it will clear this too as shown now:

c( "", "WB (15)" , "" , "CITG-TP (17)","PS (15)BK-AR (15)")

How can I do this?

Upvotes: 0

Views: 103

Answers (3)

AndS.
AndS.

Reputation: 8110

I strongly doubt this is the most efficient regex, but this gets you the exact output you're looking for:

library(stringr)
str_replace_all(x, "CT\\s\\(\\d+\\)|BK\\s\\(\\d+\\)|PS\\s\\(\\d+\\)|\\s\\(\\d+\\)","")
[1] "WB"      "WB"      "NBIO"    "CITG-TP" "BK-AR" 

I played around with is some more and this looks also works.

str_replace_all(x, "\\s\\(\\d+\\)|CT|PS|BK(?=\\s)","")
[1] "WB"      "WB"      "NBIO"    "CITG-TP" "BK-AR" 

Here is a more general approach

strReverse <- function(x){
    sapply(lapply(strsplit(x, NULL), rev), paste, collapse="")
}

strReverse(str_extract(strReverse(x),"(?<=\\(\\s).*?(?=(\\)|$))"))
[1] "WB"      "WB"      "NBIO"    "CITG-TP" "BK-AR" 

I'm there is probably some way to select the last occurrence of a pattern, but I was having some trouble with that, so I defined a function to reverse the string and take the first occurrence of pattern and then we just put the string back in the correct order order.

Upvotes: 1

Onyambu
Onyambu

Reputation: 79208

 sub(".*?([^)]+)\\s\\(\\d+\\)$","\\1",x)
[1] "WB"      "WB"      "NBIO"    "CITG-TP" "BK-AR"  

Upvotes: 1

Carlos Eduardo Lagosta
Carlos Eduardo Lagosta

Reputation: 1001

This will remove numbers between quotation marks and them select the last code in each string. I'm using pipes (%>%) to leave the code cleaner.

library(magrittr)  # pipe operators
newx <- 
  x %>% 
  gsub('[[:blank:]]\\([[:digit:]]*\\)', ';', .) %>%  # change all " (NN)" to ";"
  strsplit(split = ';') %>%                          # split the strings into a list
  lapply(rev) %>%                                    # revert the order
  lapply('[[', 1) %>%                                # select first element
  unlist()                                           # change back to vector

> newx
[1] "WB"      "WB"      "NBIO"    "CITG-TP" "BK-AR"  

Upvotes: 2

Related Questions