Reputation: 45
I need to compute the hamming distance and plot them in clusters in R for a dataset that has 2 columns and 45,000+ rows. Is there well known library available for this? Or do any strategies come recommended stronger than others?
I tried the hamming.distance function from the package "e1071", and get the error below. But even if I figure out how to calculate the hamming distance, I am not sure how to transition from those results to a cluster graph?
Error: evaluation nested too deeply: infinite recursion/options(expressions=)?
2015-02-02 18:50:59.704 R[1162:679616] Communications error: <OS_xpc_error<error: 0x7fff7aaadb60> { count = 1, contents =
"XPCErrorDescription" => <string: 0x7fff7aaadfa8> { length = 22, contents = "Connection interrupted" }
I tried this code:
H<-hamming.distance(df)
Where df looks like this:
Name Code
name1 0
name2 0
name3 1
name4 1
name5 0
Thank you for looking at this question and any help is greatly appreciated.
Upvotes: 1
Views: 2561
Reputation: 690
To compare each row value to the previous row value, create a new column that is the previous row and apply this function across both columns.
df = data.frame(x1=as.character(c("0", "0", "1")))
df$x2 = c(NA, df$x1[-1])
hamming.distance = function(string1, string2){
if (is.na(string2)==T) {
return (NULL)
}
string1 = as.character(string1)
string2 = as.character(string2)
length.string1 = nchar(string1)
length.string2 = nchar(string2)
if (length.string1 != length.string2) warning("Inputs must be of equal length")
string.temp1 = c()
for (i in 1:length.string1){
string.temp1[i] = substr(string1, start=i, stop=i)
}
string.temp2 = c()
for (i in 1:length.string2){
string.temp2[i] = substr(string2, start=i, stop=i)
}
return(sum(string.temp1 != string.temp2))
}
results = mapply(hamming.distance, df[,1], df[,2])
unlist(results)
Note: the length of unlist(results)
will be 1 shorter than the number of rows in your df
object because the first entry is NA and unlist
removes that value.
Upvotes: 2
Reputation: 2777
You can use stringdist package to calculate hamming distance: http://cran.r-project.org/web/packages/stringdist/stringdist.pdf
For example:
library(stringdist)
df <- data.frame( column1 = c("toned", "10112"), column2 = c("roses", "10223"))
stringdistmatrix(df$column1, df$column2, method = c("hamming"))#for distance matrix
stringdist(df$column1, df$column2, method = c("hamming"))#for vector of distance
Upvotes: 1