Reputation: 2385
This question is just asking for an implementation in R of the following question : Find the longest common starting substring in a set of strings (JavaScript)
"This problem is a more specific case of the Longest common substring problem. I need to only find the longest common starting substring in an array".
So im just looking an R implementation for this question (preferably not in a for / while loop fashion that was suggested in the JavaScript version), if possible i would like to wrap it up as a function, so i could apply on many groups in a data table.
After some searches, i couldn't find an R example for this, hence this question.
Example Data: I have the following vector of characters:
dput(data)
c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
I'm looking to run an algorithm in R that will find the following output: ADA44
.
From what I've seen in the JavaScript accepted answer, the idea is to first sort the vector, extract the first and last elements (for example : "ADA4417-3ARMZ-R7"
and "ADA4432-1BRJZ-R2"
, break them into single characters, and loop through them until one of the characters don't match (hope im right)
Any Help on that would be great!
Upvotes: 7
Views: 3582
Reputation: 175
Here is a compact solution:
data<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", "ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
substr(data[1],1,which.max(apply(do.call(rbind,lapply(strsplit(data,''),`length<-`,nchar(data[1]))),2,function(i)!length(unique(i))==1))-1)
[1] "ADA44"
Upvotes: 1
Reputation: 4711
Piggybacking off Henrik's answer, Bioconductor has a C based prefix function and an R based one. The R based one is:
lcPrefix <- function (x, ignore.case = FALSE)
{
x <- as.character(x)
if (ignore.case)
x <- toupper(x)
nc <- nchar(x, type = "char")
for (i in 1:min(nc)) {
ss <- substr(x, 1, i)
if (any(ss != ss[1])) {
return(substr(x[1], 1, i - 1))
}
}
substr(x[1], 1, i)
}
<environment: namespace:Biobase>
... and doesn't require any special features of Bioconductor (as far as I can tell).
--- Citation ---
Orchestrating high-throughput genomic analysis with Bioconductor. W. Huber, V.J. Carey, R. Gentleman, ..., M. Morgan Nature Methods,
2015:12, 115.
Upvotes: 3
Reputation: 67788
A non-base
alternative, using the lcprefix
function in Biostrings
to find the "Longest Common Prefix [...] of two strings"
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
x2 <- sort(x)
substr(x2[1], start = 1, stop = lcprefix(x2[1], x2[length(x2)]))
# [1] "ADA44"
Upvotes: 5
Reputation: 24074
Taking inspiration from what you suggested, you can try this function :
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}
UPDATE
Following @nicola suggestion, a simpler and more elegant variant for the function:
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}
Examples:
With your data
x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"
When there is no common starting substring
x<-c("abc","def")
> comsub(x)
# character(0)
Upvotes: 13