jonny jeep
jonny jeep

Reputation: 417

extract word from string and create new column in r

my data looks like this:

try=data.frame("histones"= c("encode3Ren_limb_H3K27me3_E10","encode3Ren_facial_prominence_H3K27me3_E10", "encode3Ren_liver_H3K27me3_E12", "encode3Ren_neural_tube_H3K27me3_E14", "encode3Ren_neural_tube_H3K4me1_E12" ,"encode3Ren_neural_tube_H3K27me3_E11", "encode3Ren_neural_tube_H3K4me1_E15", "encode3Ren_neural_tube_H3K4me2_E13" ), "a"= c(1,2,3,4,5,6,7,8))

try
                                   histones a
1              encode3Ren_limb_H3K27me3_E10 1
2 encode3Ren_facial_prominence_H3K27me3_E10 2
3             encode3Ren_liver_H3K27me3_E12 3
4       encode3Ren_neural_tube_H3K27me3_E14 4
5        encode3Ren_neural_tube_H3K4me1_E12 5
6       encode3Ren_neural_tube_H3K27me3_E11 6
7        encode3Ren_neural_tube_H3K4me1_E15 7
8        encode3Ren_neural_tube_H3K4me2_E13 8

and I would to extract from the column "histones" only the histone mark (i.e. H3K27me3, H3K4me2), putting it in new column. I'm not able to use regular expression, so any help are very appreciated.

Upvotes: 0

Views: 41

Answers (3)

Andre Wildberg
Andre Wildberg

Reputation: 19191

A base R option using gsub

cbind(try, mod = gsub(".*_([H\\d+])|_[Ee]\\d+$", "\\1", try$histones))
                                   histones a      mod
1              encode3Ren_limb_H3K27me3_E10 1 H3K27me3
2 encode3Ren_facial_prominence_H3K27me3_E10 2 H3K27me3
3             encode3Ren_liver_H3K27me3_E12 3 H3K27me3
4       encode3Ren_neural_tube_H3K27me3_E14 4 H3K27me3
5        encode3Ren_neural_tube_H3K4me1_E12 5  H3K4me1
6       encode3Ren_neural_tube_H3K27me3_E11 6 H3K27me3
7        encode3Ren_neural_tube_H3K4me1_E15 7  H3K4me1
8        encode3Ren_neural_tube_H3K4me2_E13 8  H3K4me2

Upvotes: 1

jkatam
jkatam

Reputation: 3447

Please check the str_extract from stringr

try %>% mutate(hist=str_extract(histones, '\\w\\d\\w\\d+.*\\d(?=\\_)'))

Created on 2023-01-21 with reprex v2.0.2

                                   histones a     hist
1              encode3Ren_limb_H3K27me3_E10 1 H3K27me3
2 encode3Ren_facial_prominence_H3K27me3_E10 2 H3K27me3
3             encode3Ren_liver_H3K27me3_E12 3 H3K27me3
4       encode3Ren_neural_tube_H3K27me3_E14 4 H3K27me3
5        encode3Ren_neural_tube_H3K4me1_E12 5  H3K4me1
6       encode3Ren_neural_tube_H3K27me3_E11 6 H3K27me3
7        encode3Ren_neural_tube_H3K4me1_E15 7  H3K4me1
8        encode3Ren_neural_tube_H3K4me2_E13 8  H3K4me2

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 522699

Well actually regular expressions are a good choice here:

try$mark <- str_extract(try$histones, "(?<=_)H\\d+K\\d+\\w+?(?=_)")

If you really can't use regex for some reason, here is an option using base R string functions:

x <- "encode3Ren_facial_prominence_H3K27me3_E10"
mark <- tail(unlist(strsplit(x, "_")), 2)[-2]
mark

[1] "H3K27me3"

Upvotes: 0

Related Questions