Agaz Wani
Agaz Wani

Reputation: 5684

ngrams not in correct order

I am interested to find the ngrams of a string x= "A T G C C G C G T" . I use the ngram R package to get the ngrams. I use following lines to get my job done.

library(ngram)    
ng <- ngram(x,n=2)
ngrams_out = get.ngrams(ng)
ngrams_final <-  gsub(" ", "",ngrams_out , fixed = TRUE)
# "CG" "TG" "AT" "GC" "CC" "GT" ## ngrams

It gives all the ngrams of the said string without repetition, but i am surprised that ngrams are not in correct order. The order is very important to trace the position of an ngram. The correct order of the ngrams is "AT","TG","GC","CC","CG","GC","CG","GT" with repetition, from where i can clearly make out the position of a particular ngram in the given string.

Upvotes: 1

Views: 112

Answers (3)

Ken Benoit
Ken Benoit

Reputation: 14902

The text analysis package quanteda has a great ngram generator:

require(quanteda)
unlist(tokenize("A T G C C G C G T", ngrams = 2, concatenator = ""))
## [1] "AT" "TG" "GC" "CC" "CG" "GC" "CG" "GT"

Here I have converted the tokenizedText class object (a type of list) returned from tokenize() into the simple vector you want.

Upvotes: 1

akrun
akrun

Reputation: 887153

We can scan the string 'x' to get the individual characters, and then paste the adjacent elements together.

 v1 <- scan(text=x, what='')
 paste0(v1[-length(v1)], v1[-1])
 #[1] "AT" "TG" "GC" "CC" "CG" "GC" "CG" "GT"

For the updated question,

 x1 <- gsub('\\s+', '', x)
 n <- 3
 pat <- paste0('.{', n,'}')
 library(stringi)
 v1 <- c(stri_list2matrix(lapply(seq_len(n), function(i) 
    stri_extract_all_regex(substring(x1,i), pat)[[1]]),byrow=TRUE))
 v1[!is.na(v1)]
 #[1] "ATG" "TGC" "GCC" "CCG" "CGC" "GCG" "CGT"

Changing to

 n <- 4
 v1[!is.na(v1)]
 #[1] "ATGC" "TGCC" "GCCG" "CCGC" "CGCG" "GCGT"

Upvotes: 3

Avinash Raj
Avinash Raj

Reputation: 174706

Don't know about ngram but you should produce the output like this,

x= "A T G C C G C G T"
strsplit(gsub("(\\S)(?=\\s(\\S))|\\s+\\S$", "\\1\\2", x, perl=T), " ")[[1]]
# [1] "AT" "TG" "GC" "CC" "CG" "GC" "CG" "GT"

DEMO

Upvotes: 3

Related Questions