Lin Caijin
Lin Caijin

Reputation: 599

Extract letters and numbers in specific positions from a mixed string

df
   Chromosome aaChange
1          16 p.E548fs
2          16   p.S64X
3          16   p.P23H
4          16   p.G18V
5          16  p.L251S

I want to extract the third letter and the numbers followed. Below is the output I want.

   Chromosome aaChange Protein_position
 1         16 p.E548fs             E548
 2         16   p.S64X              S64
 3         16   p.P23H              P23
 4         16   p.G18V              G18
 5         16  p.L251S             L251

Thanks.

Upvotes: 0

Views: 283

Answers (3)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

The pattern you want to match seems to be quite straightforward: it always starts with a capital letter and is followed immediately by a series of one or more digits. This gives the pattern [A-Z]\\d+. We can input it into str_extract:

library(stringr)
df$Protein_position <- str_extract(df$aaChange, "[A-Z]\\d+")
  Chromosome aaChange Protein_position
1         16 p.E548fs             E548
2         16   p.S64X              S64
3         16   p.P23H              P23
4         16   p.G18V              G18
5         16  p.L251S             L251

Upvotes: 1

akrun
akrun

Reputation: 887128

With tidyverse

library(dplyr)
library(stringr)
df %>%
   mutate(Protein_position = str_replace(aaChange,
      '^[^.]+\\.(.*)[^0-9]+$', '\\1'))

-output

#  Chromosome aaChange Protein_position
#1         16 p.E548fs            E548f
#2         16   p.S64X              S64
#3         16   p.P23H              P23
#4         16   p.G18V              G18
#5         16  p.L251S             L251

data

df <- structure(list(Chromosome = c(16L, 16L, 16L, 16L, 16L), 
aaChange = c("p.E548fs", "p.S64X", "p.P23H", "p.G18V", "p.L251S")), 
class = "data.frame", row.names = c(NA, -5L))

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388982

You can do this with sub in base R :

transform(df, Protein_position = sub('..(.\\d+).*', '\\1', aaChange))

#  Chromosome aaChange Protein_position
#1         16 p.E548fs             E548
#2         16   p.S64X              S64
#3         16   p.P23H              P23
#4         16   p.G18V              G18
#5         16  p.L251S             L251

data

df <- structure(list(Chromosome = c(16L, 16L, 16L, 16L, 16L), 
aaChange = c("p.E548fs", "p.S64X", "p.P23H", "p.G18V", "p.L251S")), 
class = "data.frame", row.names = c(NA, -5L))

Upvotes: 1

Related Questions