Sebastian Zeki
Sebastian Zeki

Reputation: 6874

split string in a column of a dataframe and return new column with split

I have a dataframe called dat which has two columns as below

col1   col2
chr2   atagaaaaatcggctgggtgcggtggctcactcctataatcccagcactttg
chr3   atagaaaaatcggctgggtgcggtggctcactcctataatcccagcactttg

I want to be able to split the string at a match for gtggctc and to return a new column with the match included up to a specified length (eg 10 further characters as follows

col1   col2                      new_split_col
chr2   atagaaaaatcggctgggtgcg    gtggctcactcctataa
chr3   atagaaaaatcggctgggtgcg    gtggctcactcctataa

I have tried

library(stringr)
dat$new_split_col <- str_split(dat$col2, "gtggctc", 2)

but it gives me two matches in one column and doesnt include the match itself. It also doesnt allow me to specify the length of the desired match.

Upvotes: 1

Views: 111

Answers (1)

akrun
akrun

Reputation: 887008

Try

library(stringr)
dat[c('col2', 'new_split_col')] <-  do.call(rbind,lapply(str_split(dat$col2,
     perl('(?=gtggctc)'), 2), function(x) c(x[1],substr(x[2],1,17))))

Or

library(tidyr)
extract(dat, col2, into=c('col2', 'new_split_col'), '(.*)(gtggctc.{10}).*')
#  col1                   col2     new_split_col
#1 chr2 atagaaaaatcggctgggtgcg gtggctcactcctataa
#2 chr3 atagaaaaatcggctgggtgcg gtggctcactcctataa

Or

dat[c('col2', 'new_split_col')] <- read.table(text=gsub('(.*)(gtggctc.{10}).*',
         '\\1 \\2', dat$col2))

Upvotes: 2

Related Questions