Reputation: 6874
I have a dataframe called dat which has two columns as below
col1 col2
chr2 atagaaaaatcggctgggtgcggtggctcactcctataatcccagcactttg
chr3 atagaaaaatcggctgggtgcggtggctcactcctataatcccagcactttg
I want to be able to split the string at a match for gtggctc and to return a new column with the match included up to a specified length (eg 10 further characters as follows
col1 col2 new_split_col
chr2 atagaaaaatcggctgggtgcg gtggctcactcctataa
chr3 atagaaaaatcggctgggtgcg gtggctcactcctataa
I have tried
library(stringr)
dat$new_split_col <- str_split(dat$col2, "gtggctc", 2)
but it gives me two matches in one column and doesnt include the match itself. It also doesnt allow me to specify the length of the desired match.
Upvotes: 1
Views: 111
Reputation: 887008
Try
library(stringr)
dat[c('col2', 'new_split_col')] <- do.call(rbind,lapply(str_split(dat$col2,
perl('(?=gtggctc)'), 2), function(x) c(x[1],substr(x[2],1,17))))
Or
library(tidyr)
extract(dat, col2, into=c('col2', 'new_split_col'), '(.*)(gtggctc.{10}).*')
# col1 col2 new_split_col
#1 chr2 atagaaaaatcggctgggtgcg gtggctcactcctataa
#2 chr3 atagaaaaatcggctgggtgcg gtggctcactcctataa
Or
dat[c('col2', 'new_split_col')] <- read.table(text=gsub('(.*)(gtggctc.{10}).*',
'\\1 \\2', dat$col2))
Upvotes: 2