Reputation: 1139
I have a file that represents the gene structure of bacteria models. Each row represents a model. A row is a fixed length binary string of which genes are present (1 for present and 0 for absent). My task is to compare the gene sequence for each pair of models and get a score of how similar they are and computer a dissimilarity matrix.
In total there are 450 models (rows) in one file and there are 250 files. I have a working code however it takes roughly 1.6 hours to do the whole thing for only one file.
#Sample Data
Generation: 0
[0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0]
[1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1]
[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0]
[0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0]
[0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0]
[1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]
What my code does:
generationFiles = list.files(pattern = "^Generation.*\\_\\d+.txt$")
start.time = Sys.time()
for(a in 1:length(generationFiles)){
fname = generationFiles[a]
geneData = read.table(generationFiles[a], sep = "\n", header = T, stringsAsFactors = F)
geneCount = str_count(geneData[1,1],"[1|0]")
geneDF <- data.frame(Gene = paste0("Gene_", c(1:geneCount)), stringsAsFactors = F)
#convert the string into a data frame
for(i in 1:nrow(geneData)){
#remove the square brackets
dataRow = substring(geneData[i,1], 2, nchar(geneData[i,1]) - 1)
#removing white spaces
dataRow = gsub(" ", "", dataRow, fixed = T)
#splitting the string
dataRow = strsplit(dataRow, ",")
#converting to numeric
dataRow = as.numeric(unlist(dataRow))
colName = paste("M_",i,sep = "")
geneDF <- cbind(geneDF, dataRow)
colnames(geneDF)[colnames(geneDF) == 'dataRow'] <- colName
dataRow <- NULL
}
summaryDF <- data.frame(Model1 = character(), Model2 = character(), Common = integer(),
Uncommon = integer(), Absent = integer(), stringsAsFactors = F)
modelNames = paste0("M_",c(1:450))
secondaryLevel = modelNames
fileName = paste0("D://BellosData//GC_3//Summary//",substr(fname, 1, nchar(fname) - 4),"_Summary.txt")
for(x in 1:449){
secondaryLevel = secondaryLevel[-1]
for(y in 1:length(secondaryLevel)){
result = geneDF[modelNames[x]] + geneDF[secondaryLevel[y]]
summaryDF <- rbind(summaryDF, data.frame(Model1 = modelNames[x],
Model2 = secondaryLevel[y],
Common = sum(result == 2),
Uncommon = sum(result == 1),
Absent = sum(result == 0)))
}
}
write.table(summaryDF, fileName, sep = ",", quote = F, row.names = F)
geneDF <- NULL
summaryDF <- NULL
geneData <-NULL
}
converting to matrix
maxNum = max(summaryDF$Common)
normalizeData = summaryDF[,c(1:3)]
normalizeData[c('Common')] <- lapply(normalizeData[c('Common')], function(x) 1 - x/maxNum)
normalizeData[1:2] <- lapply(normalizeData[1:2], factor, levels=unique(unlist(normalizeData[1:2])))
distMatrixN = xtabs(Common~Model1+Model2, data=normalizeData)
distMatrixN = distMatrixN + t(distMatrixN)
Is there a way to make the process run faster? Is there a more efficient way to do the comparison?
Upvotes: 2
Views: 91
Reputation: 3188
This code should be faster. Nested loops are nightmare slow in R. Operations like rbind-ing
one row at a time is also among the worst and slowest ideas in R programming.
Generate 450 rows with 20 elements of 0, 1 on each row.
M = do.call(rbind, replicate(450, sample(0:1, 20, replace = T), simplify = F))
Generate list of combination(450, 2) numbers of row pairs
L = split(v<-t(utils::combn(450, 2)), seq(nrow(v))); rm(v)
Apply whatever comparison function you want. In this case, the number of 1's at the same position for each row combinations. If you want to calculate different metrics, just write another function(x) where M[x[1],]
is the first row and M[x[2],]
is the second row.
O = lapply(L, function(x) sum(M[x[1],]&M[x[2],]))
Code takes ~4 seconds a fairly slow 2.6 Ghz Sandy Bridge
Get a clean data.frame with your results, three columns : row 1, row 2, metric between the two rows
data.frame(row1 = sapply(L, `[`, 1),
row2 = sapply(L, `[`, 2),
similarity_metric = do.call(rbind, O))
To be honest, I didn't thoroughly comb through your code to replicate exactly what you were doing. If this is not what you are looking for (or can't be modified to achieve what you are looking for), leave a comment.
Upvotes: 3