Reputation: 599
df
Chromosome aaChange
1 16 p.E548fs
2 16 p.S64X
3 16 p.P23H
4 16 p.G18V
5 16 p.L251S
I want to extract the third letter and the numbers followed. Below is the output I want.
Chromosome aaChange Protein_position
1 16 p.E548fs E548
2 16 p.S64X S64
3 16 p.P23H P23
4 16 p.G18V G18
5 16 p.L251S L251
Thanks.
Upvotes: 0
Views: 283
Reputation: 21400
The pattern you want to match seems to be quite straightforward: it always starts with a capital letter and is followed immediately by a series of one or more digits. This gives the pattern [A-Z]\\d+
. We can input it into str_extract
:
library(stringr)
df$Protein_position <- str_extract(df$aaChange, "[A-Z]\\d+")
Chromosome aaChange Protein_position
1 16 p.E548fs E548
2 16 p.S64X S64
3 16 p.P23H P23
4 16 p.G18V G18
5 16 p.L251S L251
Upvotes: 1
Reputation: 887128
With tidyverse
library(dplyr)
library(stringr)
df %>%
mutate(Protein_position = str_replace(aaChange,
'^[^.]+\\.(.*)[^0-9]+$', '\\1'))
-output
# Chromosome aaChange Protein_position
#1 16 p.E548fs E548f
#2 16 p.S64X S64
#3 16 p.P23H P23
#4 16 p.G18V G18
#5 16 p.L251S L251
df <- structure(list(Chromosome = c(16L, 16L, 16L, 16L, 16L),
aaChange = c("p.E548fs", "p.S64X", "p.P23H", "p.G18V", "p.L251S")),
class = "data.frame", row.names = c(NA, -5L))
Upvotes: 0
Reputation: 388982
You can do this with sub
in base R :
transform(df, Protein_position = sub('..(.\\d+).*', '\\1', aaChange))
# Chromosome aaChange Protein_position
#1 16 p.E548fs E548
#2 16 p.S64X S64
#3 16 p.P23H P23
#4 16 p.G18V G18
#5 16 p.L251S L251
data
df <- structure(list(Chromosome = c(16L, 16L, 16L, 16L, 16L),
aaChange = c("p.E548fs", "p.S64X", "p.P23H", "p.G18V", "p.L251S")),
class = "data.frame", row.names = c(NA, -5L))
Upvotes: 1