BioMan
BioMan

Reputation: 704

String split in R in specific context

I need to split the column names RefSeq using the _ that occurs before NM without splitting the part that is between NM and the number. I need the output to be in a new column of my input.

Tried something like:

strsplit(as.character(TargetScan$RefSeq),"_")

data

> head(TargetScan)
  Gene         miRNA    Site cont.score cont.score.perc
1 A1CF hsa-let-7a-5p 8mer-1a     -0.051              12
2 A1CF hsa-let-7b-5p 8mer-1a     -0.051              12
3 A1CF hsa-let-7c-5p 8mer-1a     -0.051              12
4 A1CF hsa-let-7d-5p 8mer-1a     -0.062              12
5 A1CF hsa-let-7e-5p 8mer-1a     -0.051              12
6 A1CF hsa-let-7f-5p 8mer-1a     -0.051              12
                                                                RefSeq
1 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
2 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
3 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
4 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
5 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933
6 NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933

out

> head(TargetScan)
  Gene         miRNA    Site cont.score cont.score.perc
1 A1CF hsa-let-7a-5p 8mer-1a     -0.051              12
2 A1CF hsa-let-7b-5p 8mer-1a     -0.051              12
3 A1CF hsa-let-7c-5p 8mer-1a     -0.051              12
4 A1CF hsa-let-7d-5p 8mer-1a     -0.062              12
5 A1CF hsa-let-7e-5p 8mer-1a     -0.051              12
6 A1CF hsa-let-7f-5p 8mer-1a     -0.051              12
  new1         new2      new3      new4          new5         new6                        
1 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
2 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
3 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
4 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
5 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933
6 NM_001198820 NM_014576 NM_138932 NM_001198819 NM_001198818 NM_138933

Upvotes: 0

Views: 107

Answers (3)

Pierre L
Pierre L

Reputation: 28441

strsplit(x, "(?<=\\d)_", perl=T)[[1]]
#[1] "NM_001198820" "NM_014576"    "NM_138932"    "NM_001198819"
#[5] "NM_001198818" "NM_138933"  

This approach uses a look-behind. Following the pattern of the string, "(?<=\\d)_" we match an underscore preceded by a number.

Wrapped in a function for the desired output:

library(tidyr)
separate(TargetScan, RefSeq, paste0("new", 1:6), "(?<=\\d)_")
#   Gene         miRNA    Site cont.score cont.score.perc         new1      new2
# 1 A1CF hsa-let-7a-5p 8mer-1a     -0.051              12 NM_001198820 NM_014576
# 2 A1CF hsa-let-7b-5p 8mer-1a     -0.051              12 NM_001198820 NM_014576
# 3 A1CF hsa-let-7c-5p 8mer-1a     -0.051              12 NM_001198820 NM_014576
# 4 A1CF hsa-let-7d-5p 8mer-1a     -0.062              12 NM_001198820 NM_014576
# 5 A1CF hsa-let-7e-5p 8mer-1a     -0.051              12 NM_001198820 NM_014576
# 6 A1CF hsa-let-7f-5p 8mer-1a     -0.051              12 NM_001198820 NM_014576
#        new3         new4         new5      new6
# 1 NM_138932 NM_001198819 NM_001198818 NM_138933
# 2 NM_138932 NM_001198819 NM_001198818 NM_138933
# 3 NM_138932 NM_001198819 NM_001198818 NM_138933
# 4 NM_138932 NM_001198819 NM_001198818 NM_138933
# 5 NM_138932 NM_001198819 NM_001198818 NM_138933
# 6 NM_138932 NM_001198819 NM_001198818 NM_138933

Upvotes: 3

Wilson Freitas
Wilson Freitas

Reputation: 531

Use a regular expression to match the text you want and to get it done I suggest stringr::str_match_all.

library(stringr)
s <- c('NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933',
       'NM_001198820_NM_014576_NM_138932_NM_001198819_NM_001198818_NM_138933')
str_match_all(s, '([A-Za-z]{2}_\\d+)_?')

yields

[[1]]
     [,1]            [,2]          
[1,] "NM_001198820_" "NM_001198820"
[2,] "NM_014576_"    "NM_014576"   
[3,] "NM_138932_"    "NM_138932"   
[4,] "NM_001198819_" "NM_001198819"
[5,] "NM_001198818_" "NM_001198818"
[6,] "NM_138933"     "NM_138933"   

[[2]]
     [,1]            [,2]          
[1,] "NM_001198820_" "NM_001198820"
[2,] "NM_014576_"    "NM_014576"   
[3,] "NM_138932_"    "NM_138932"   
[4,] "NM_001198819_" "NM_001198819"
[5,] "NM_001198818_" "NM_001198818"
[6,] "NM_138933"     "NM_138933"   

After that you can organize data in the returning list in a data.frame. Note that the second column has the information you want.

Upvotes: 0

AKJ88
AKJ88

Reputation: 733

I would try to replace underscore before NM using gsub and then calling strsplit on the values, something like this:

strsplit(gsub('_NM', ',NM', s), ',')

Upvotes: 0

Related Questions