Reputation: 704
I need to split the column names RefSeq
using the _
that occurs before NM
without splitting the part that is between NM
and the number.
I need the output to be in a new column of my input.
Tried something like:
strsplit(as.character(TargetScan$RefSeq),"_")
data
> head(TargetScan)
Gene miRNA Site cont.score cont.score.perc
1 A1CF hsa-let-7a-5p 8mer-1a -0.051 12
2 A1CF hsa-let-7b-5p 8mer-1a -0.051 12
3 A1CF hsa-let-7c-5p 8mer-1a -0.051 12
4 A1CF hsa-let-7d-5p 8mer-1a -0.062 12
5 A1CF hsa-let-7e-5p 8mer-1a -0.051 12
6 A1CF hsa-let-7f-5p 8mer-1a -0.051 12
RefSeq
1 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
2 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
3 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
4 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
5 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
6 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
out
> head(TargetScan)
Gene miRNA Site cont.score cont.score.perc
1 A1CF hsa-let-7a-5p 8mer-1a -0.051 12
2 A1CF hsa-let-7b-5p 8mer-1a -0.051 12
3 A1CF hsa-let-7c-5p 8mer-1a -0.051 12
4 A1CF hsa-let-7d-5p 8mer-1a -0.062 12
5 A1CF hsa-let-7e-5p 8mer-1a -0.051 12
6 A1CF hsa-let-7f-5p 8mer-1a -0.051 12
new1 new2 new3 new4 new5 new6
1 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
2 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
3 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
4 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
5 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
6 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
Upvotes: 0
Views: 107
Reputation: 28441
strsplit(x, "(?<=\\d)_", perl=T)[[1]]
#[1] "NM_001198820" "NM_014576" "NM_138932" "NM_001198819"
#[5] "NM_001198818" "NM_138933"
This approach uses a look-behind. Following the pattern of the string, "(?<=\\d)_"
we match an underscore preceded by a number.
Wrapped in a function for the desired output:
library(tidyr)
separate(TargetScan, RefSeq, paste0("new", 1:6), "(?<=\\d)_")
# Gene miRNA Site cont.score cont.score.perc new1 new2
# 1 A1CF hsa-let-7a-5p 8mer-1a -0.051 12 NM_001198820 NM_014576
# 2 A1CF hsa-let-7b-5p 8mer-1a -0.051 12 NM_001198820 NM_014576
# 3 A1CF hsa-let-7c-5p 8mer-1a -0.051 12 NM_001198820 NM_014576
# 4 A1CF hsa-let-7d-5p 8mer-1a -0.062 12 NM_001198820 NM_014576
# 5 A1CF hsa-let-7e-5p 8mer-1a -0.051 12 NM_001198820 NM_014576
# 6 A1CF hsa-let-7f-5p 8mer-1a -0.051 12 NM_001198820 NM_014576
# new3 new4 new5 new6
# 1 NM_138932 NM_001198819 NM_001198818 NM_138933
# 2 NM_138932 NM_001198819 NM_001198818 NM_138933
# 3 NM_138932 NM_001198819 NM_001198818 NM_138933
# 4 NM_138932 NM_001198819 NM_001198818 NM_138933
# 5 NM_138932 NM_001198819 NM_001198818 NM_138933
# 6 NM_138932 NM_001198819 NM_001198818 NM_138933
Upvotes: 3
Reputation: 531
Use a regular expression to match the text you want and to get it done I suggest stringr::str_match_all
.
library(stringr)
s <- c('NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933',
'NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933')
str_match_all(s, '([A-Za-z]{2}_\\d+)_?')
yields
[[1]]
[,1] [,2]
[1,] "NM_001198820_" "NM_001198820"
[2,] "NM_014576_" "NM_014576"
[3,] "NM_138932_" "NM_138932"
[4,] "NM_001198819_" "NM_001198819"
[5,] "NM_001198818_" "NM_001198818"
[6,] "NM_138933" "NM_138933"
[[2]]
[,1] [,2]
[1,] "NM_001198820_" "NM_001198820"
[2,] "NM_014576_" "NM_014576"
[3,] "NM_138932_" "NM_138932"
[4,] "NM_001198819_" "NM_001198819"
[5,] "NM_001198818_" "NM_001198818"
[6,] "NM_138933" "NM_138933"
After that you can organize data in the returning list in a data.frame. Note that the second column has the information you want.
Upvotes: 0
Reputation: 733
I would try to replace underscore before NM
using gsub
and then calling strsplit
on the values, something like this:
strsplit(gsub('_NM', ',NM', s), ',')
Upvotes: 0