Reputation: 5684
I am interested to find the ngrams
of a string x= "A T G C C G C G T"
. I use the ngram
R
package to get the ngrams
. I use following lines to get my job done.
library(ngram)
ng <- ngram(x,n=2)
ngrams_out = get.ngrams(ng)
ngrams_final <- gsub(" ", "",ngrams_out , fixed = TRUE)
# "CG" "TG" "AT" "GC" "CC" "GT" ## ngrams
It gives all the ngrams
of the said string without repetition, but i am surprised that ngrams
are not in correct order. The order is very important to trace the position of an ngram
. The correct order of the ngrams
is "AT","TG","GC","CC","CG","GC","CG","GT"
with repetition, from where i can clearly make out the position of a particular ngram
in the given string.
Upvotes: 1
Views: 112
Reputation: 14902
The text analysis package quanteda has a great ngram generator:
require(quanteda)
unlist(tokenize("A T G C C G C G T", ngrams = 2, concatenator = ""))
## [1] "AT" "TG" "GC" "CC" "CG" "GC" "CG" "GT"
Here I have converted the tokenizedText class object (a type of list) returned from tokenize()
into the simple vector you want.
Upvotes: 1
Reputation: 887153
We can scan
the string 'x' to get the individual characters, and then paste
the adjacent elements together.
v1 <- scan(text=x, what='')
paste0(v1[-length(v1)], v1[-1])
#[1] "AT" "TG" "GC" "CC" "CG" "GC" "CG" "GT"
For the updated question,
x1 <- gsub('\\s+', '', x)
n <- 3
pat <- paste0('.{', n,'}')
library(stringi)
v1 <- c(stri_list2matrix(lapply(seq_len(n), function(i)
stri_extract_all_regex(substring(x1,i), pat)[[1]]),byrow=TRUE))
v1[!is.na(v1)]
#[1] "ATG" "TGC" "GCC" "CCG" "CGC" "GCG" "CGT"
Changing to
n <- 4
v1[!is.na(v1)]
#[1] "ATGC" "TGCC" "GCCG" "CCGC" "CGCG" "GCGT"
Upvotes: 3
Reputation: 174706
Don't know about ngram
but you should produce the output like this,
x= "A T G C C G C G T"
strsplit(gsub("(\\S)(?=\\s(\\S))|\\s+\\S$", "\\1\\2", x, perl=T), " ")[[1]]
# [1] "AT" "TG" "GC" "CC" "CG" "GC" "CG" "GT"
Upvotes: 3