Reputation: 854

Generating substrings and random strings in R

Please bear with me, I come from a Python background and I am still learning string manipulation in R.

Ok, so lets say I have a string of length 100 with random A, B, C, or D letters:

> df<-c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")
> df
[1]"ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD"

I would like to do the following two things:

1) Generate a '.txt' file that is comprised of 20-length subsections of the above string, each starting one letter after the previous with their own unique name on the line above it, like this:

NAME1
ABCBDBDBCBABABDBCBCB
NAME2
BCBDBDBCBABABDBCBCBD
NAME3
CBDBDBCBABABDBCBCBDB
NAME4
BDBDBCBABABDBCBCBDBD

... and so forth

2) Take that generated list and from it comprise another list that has the same exact substrings with the only difference being a change of one or two of the A, B, C, or Ds to another A, B, C, or D (any of those four letters only).

So, this:

NAME1
ABCBDBDBCBABABDBCBCB

Would become this:

NAME1.1
ABBBDBDBCBDBABDBCBCB

As you can see, the "C" in the third position became a "B" and the "A" in position 11 became a "D", with no implied relationship between those changed letters. Purely random.

I know this is a convoluted question, but like I said, I am still learning basic text and string manipulation in R.

Thanks in advance.

Upvotes: 2

Answers (4)

nograpes

Reputation: 18323

I tried breaking this down into multiple simple steps, hopefully you can get learn a few tricks from this:

# Random data
df<-c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")
n<-10 # Number of cuts
set.seed(1)
# Pick n random numbers between 1 and the length of string-20
nums<-sample(1:(nchar(df)-20),n,replace=TRUE)
# Make your cuts
cuts<-sapply(nums,function(x) substring(df,x,x+20-1))
# Generate some names
nams<-paste0('NAME',1:n)
# Make it into a matrix, transpose, and then recast into a vector to get alternating names and cuts.
names.and.cuts<-c(t(matrix(c(nams,cuts),ncol=2)))
# Drop a file.
write.table(names.and.cuts,'file.txt',quote=FALSE,row.names=FALSE,col.names = FALSE)

# Pick how many changes are going to be made to each cut.
changes<-sample(1:2,n,replace=2)
# Pick that number of positions to change
pos.changes<-lapply(changes,function(x) sample(1:20,x))
# Find the letter at each position.
letter.at.change.pos<-lapply(pos.changes,function(x) substring(df,x,x))
# Make a function that takes any letter, and outputs any other letter from c(A-D)                             
letter.map<-function(x){
    # Make a list of alternate letters.
    alternates<-lapply(x,setdiff,x=c('A','B','C','D'))
    # Pick one of each
    sapply(alternates,sample,size=1)
}
# Find another letter for each
letter.changes<-lapply(letter.at.change.pos,letter.map)
# Make a function to replace character by position
# Inefficient, but who cares.
rep.by.char<-function(str,pos,chars){
  for (i in 1:length(pos)) substr(str,pos[i],pos[i])<-chars[i]
  str
}

# Change every letter at pos.changes to letter.changes
mod.cuts<-mapply(rep.by.char,cuts,pos.changes,letter.changes,USE.NAMES=FALSE)
# Generate names
nams<-paste0(nams,'.1')
# Use the matrix trick to alternate names.Drop a file.
names.and.mod.cuts<-c(t(matrix(c(nams,mod.cuts),ncol=2)))
write.table(names.and.mod.cuts,'file2.txt',quote=FALSE,row.names=FALSE,col.names = FALSE)

Also, instead of the rep.by.char function, you could just use strsplit and replace like this:

mod.cuts<-mapply(function(x,y,z) paste(replace(x,y,z),collapse=''),
   strsplit(cuts,''),pos.changes,letter.changes,USE.NAMES=FALSE)

Upvotes: 4

Jota

Reputation: 17611

For the first part of your question:

df <- c("ABCBDBDBCBABABDBCBCBDBDBCBDBACDBCCADCDBCDACDDCDACBCDACABACDACABBBCCCBDBDDCACDDACADDDDACCADACBCBDCACD")

nstrchars <- 20
count<- nchar(df)-nstrchars

length20substrings <- data.frame(length20substrings=sapply(1:count,function(x)substr(df,x,x+20)))

# to save to a text file.  I chose not to include row names or a column name in the .txt file file
write.table(length20substrings,"length20substrings.txt",row.names=F,col.names=F)

For the second part:

# create a function that will randomly pick one or two spots in a string and replace
# those spots with one of the other characters present in the string:

changefxn<- function(x){
 x<-as.character(x)
 nc<-nchar(as.character(x))
 id<-seq(1,nc)
 numchanges<-sample(1:2,1)
 ids<-sample(id,numchanges) 
 chars2repl<-strsplit(x,"")[[1]][ids]
 charspresent<-unique(unlist(strsplit(x,"")))
 splitstr<-unlist(strsplit(x,""))
 if (numchanges>1) {
 splitstr[id[1]] <- sample(setdiff(charspresent,chars2repl[1]),1)
 splitstr[id[2]] <- sample(setdiff(charspresent,chars2repl[2]),1)
 }
 else {splitstr[id[1]] <- sample(setdiff(charspresent,chars2repl[1]),1)
 }
 newstr<-paste(splitstr,collapse="")
 return(newstr)
}

# try it out

changefxn("asbbad")
changefxn("12lkjaf38gs")

# apply changefxn to all the substrings from part 1

length20substrings<-length20substrings[seq_along(length20substrings[,1]),]
newstrings <- lapply(length20substrings, function(ii)changefxn(ii))

Upvotes: 2

Carl Witthoft

Reputation: 21502

One way, albeit slowish:

Rgames> foo<-paste(sample(c('a','b','c','d'),20,rep=T),sep='',collapse='')
Rgames> bar<-matrix(unlist(strsplit(foo,'')),ncol=5)
Rgames> bar
     [,1] [,2] [,3] [,4] [,5]
[1,] "c"  "c"  "a"  "c"  "a" 
[2,] "c"  "c"  "b"  "a"  "b" 
[3,] "b"  "b"  "a"  "c"  "d" 
[4,] "c"  "b"  "a"  "c"  "c"

Now you can select random indices and replace the selected locations with sample(c('a','b','c','d'),1) . For "true" randomness, I wouldn't even force a change - if your newly drawn letter is the same as the original, so be it. Like this:

ibar<-sample(1:5,4,rep=T) # one random column number for each row
for ( j in 1: 4) bar[j,ibar[j]]<-sample(c('a','b','c','d'),1)

Then, if necessary, recombine each row using paste

Upvotes: 2

Sven Hohenstein

Reputation: 81693

Create a text file of substrings

n <- 20 # length of substrings

starts <- seq(nchar(df) - 20 + 1)

v1 <- mapply(substr, starts, starts + n - 1, MoreArgs = list(x = df))

names(v1) <- paste0("NAME", seq_along(v1), "\n")

write.table(v1, file = "filename.txt", quote = FALSE, sep = "",
            col.names = FALSE)

Randomly replace one or two letters (A-D):

myfun <- function() {
  idx <- sample(seq(n), sample(1:2, 1))
  rep <- sample(LETTERS[1:4], length(idx), replace = TRUE)
  return(list(idx = idx, rep = rep))
}

new <- replicate(length(v1), myfun(), simplify = FALSE)

v2 <- mapply(function(x, y, z) paste(replace(x, y, z), collapse = ""),  
             strsplit(v1, ""),
             lapply(new, "[[", "idx"),
             lapply(new, "[[", "rep"))

names(v2) <- paste0(names(v2), ".1")

write.table(v2, file = "filename2.txt", quote = FALSE, sep = "\n", 
            col.names = FALSE)

Upvotes: 4

Generating substrings and random strings in R

Answers (4)

Related Questions