zoe
zoe

Reputation: 311

Remove last characters of string if string starts with pattern

I have a column of strings that I would like to remove everything after the last '.' like so:

ENST00000338167.9
ABCDE.42927.6
ENST00000265393.10
ABCDE.43577.3
ENST00000370826.3

I would like to replace remove the '.' and everything after for the 'ENST' entries only eg:

ENST00000338167
ABCDE.42927.6
ENST00000265393
ABCDE.43577.3
ENST00000370826

I can do

function(x) sub("\\.[^.]*$", "", x)

if I try

function(x) sub("ENST*\\.[^.]*$", "", x)

this isn't quite working and I don't fully understand the regex commands.

Upvotes: 1

Views: 477

Answers (4)

Saurabh Chauhan
Saurabh Chauhan

Reputation: 3221

We can use startsWith and sub combination:

Data:

 df=read.table(text="ENST00000338167.9
  ABCDE.42927.6
  ENST00000265393.10
  ABCDE.43577.3
  ENST00000370826.3",header=F)


# if string starts with ENST then remove everything after . (dot) in the 
#  string else print the string as it is.
  ifelse(startsWith(as.character(df[,1]),"ENST"),sub("*\\..*", "", df$V1),
      as.character(df[,1]))

Output:

[1] "ENST00000338167" "ABCDE.42927.6"   "ENST00000265393" "ABCDE.43577.3"   "ENST00000370826"

Upvotes: 0

akrun
akrun

Reputation: 887951

We can use data.table to specify the logical condition in i while updating the j

library(data.table)
setDT(df)[grepl("^ENST", Col1), Col1 := sub("\\.[^.]+$", "", Col1)]
df
#             Col1
#1: ENST00000338167
#2:   ABCDE.42927.6
#3: ENST00000265393
#4:   ABCDE.43577.3
#5: ENST00000370826

data

df <- structure(list(Col1 = c("ENST00000338167.9", "ABCDE.42927.6", 
"ENST00000265393.10", "ABCDE.43577.3", "ENST00000370826.3")), row.names = c(NA, 
 -5L), class = "data.frame")

Upvotes: 0

Maurits Evers
Maurits Evers

Reputation: 50738

We can use a capture group inside a single gsub call

gsub("(^ENST\\d+)\\.\\d+", "\\1", df[, 1])
#[1] "ENST00000338167" "ABCDE.42927.6"   "ENST00000265393" "ABCDE.43577.3"
#[5] "ENST00000370826"

Sample data

df <- read.table(text =
    "ENST00000338167.9
ABCDE.42927.6
ENST00000265393.10
ABCDE.43577.3
ENST00000370826.3", header = F)

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389325

We can use combination of ifelse, grepl and sub. We first check if the string consists of "ENST" string and if it does then remove everything after "." using sub.

ifelse(grepl("^ENST", x), sub("\\..*", "", x), x)

#[1] "ENST00000338167" "ABCDE.42927.6"   "ENST00000265393" "ABCDE.43577.3"  
#[5] "ENST00000370826"

data

x <- c("ENST00000338167.9","ABCDE.42927.6","ENST00000265393.10",
       "ABCDE.43577.3","ENST00000370826.3")

Upvotes: 3

Related Questions