Reputation: 2385

R implementation for Finding the longest common starting substrings in a set of strings

This question is just asking for an implementation in R of the following question : Find the longest common starting substring in a set of strings (JavaScript)

"This problem is a more specific case of the Longest common substring problem. I need to only find the longest common starting substring in an array".

So im just looking an R implementation for this question (preferably not in a for / while loop fashion that was suggested in the JavaScript version), if possible i would like to wrap it up as a function, so i could apply on many groups in a data table.

After some searches, i couldn't find an R example for this, hence this question.

Example Data: I have the following vector of characters:

dput(data)
c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")

I'm looking to run an algorithm in R that will find the following output: ADA44.

From what I've seen in the JavaScript accepted answer, the idea is to first sort the vector, extract the first and last elements (for example : "ADA4417-3ARMZ-R7" and "ADA4432-1BRJZ-R2" , break them into single characters, and loop through them until one of the characters don't match (hope im right)

Any Help on that would be great!

Upvotes: 7

Answers (4)

Catherine Smith

Reputation: 175

Here is a compact solution:

data<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", "ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")

substr(data[1],1,which.max(apply(do.call(rbind,lapply(strsplit(data,''),`length<-`,nchar(data[1]))),2,function(i)!length(unique(i))==1))-1)
[1] "ADA44"

Upvotes: 1

russellpierce

Reputation: 4711

Piggybacking off Henrik's answer, Bioconductor has a C based prefix function and an R based one. The R based one is:

lcPrefix <- function (x, ignore.case = FALSE) 
{
    x <- as.character(x)
    if (ignore.case) 
        x <- toupper(x)
    nc <- nchar(x, type = "char")
    for (i in 1:min(nc)) {
        ss <- substr(x, 1, i)
        if (any(ss != ss[1])) {
            return(substr(x[1], 1, i - 1))
        }
    }
    substr(x[1], 1, i)
}
<environment: namespace:Biobase>

... and doesn't require any special features of Bioconductor (as far as I can tell).

--- Citation ---

Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods,
2015:12, 115.

Upvotes: 3

Henrik

Reputation: 67788

A non-base alternative, using the lcprefix function in Biostrings to find the "Longest Common Prefix [...] of two strings"

source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)

x2 <- sort(x)
substr(x2[1], start = 1, stop = lcprefix(x2[1], x2[length(x2)]))
# [1] "ADA44"

Upvotes: 5

Cath

Reputation: 24074

Taking inspiration from what you suggested, you can try this function :

comsub<-function(x) {
    # sort the vector
    x<-sort(x)
    # split the first and last element by character
    d_x<-strsplit(x[c(1,length(x))],"")
    # compute the cumulative sum of common elements
    cs_x<-cumsum(d_x[[1]]==d_x[[2]])
    # check if there is at least one common element
    if(cs_x[1]!=0) {
        # see when it stops incrementing and get the position of last common element
        der_com<-which(diff(cs_x)==0)[1]
        # return the common part
        return(substr(x[1],1,der_com))
    } else { # else, return an empty vector
        return(character(0))
    }
}

UPDATE

Following @nicola suggestion, a simpler and more elegant variant for the function:

comsub<-function(x) {
    # sort the vector
    x<-sort(x)
    # split the first and last element by character
    d_x<-strsplit(x[c(1,length(x))],"")
    # search for the first not common element and so, get the last matching one
    der_com<-match(FALSE,do.call("==",d_x))-1
    # if there is no matching element, return an empty vector, else return the common part
    ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}

Examples:

With your data

x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"

When there is no common starting substring

x<-c("abc","def")
> comsub(x)
# character(0)

Upvotes: 13

R implementation for Finding the longest common starting substrings in a set of strings

Answers (4)

Related Questions