Reputation: 1274
Good afternoon !
Under R , i developed a custom function that computes the distance between mixed vectors.
The used data is :
data=structure(list(X126 = c("X266", "B7", "T133", "J34", "T218",
"X249"), TVGUIDE = c("TVGUIDE", "MODMAT", "MASSEY", "KMART",
"MASSEY", "ROSES"), YES = c("YES", "YES", "YES", "NO", "YES",
"NO"), KEY = c("KEY", "KEY", "KEY", "KEY", "KEY", "KEY"), YES.1 = c("YES",
"YES", "YES", "YES", "YES", "YES"), BENTON = c("BENTON", "BENTON",
"BENTON", "BENTON", "BENTON", "BENTON"), GALLATIN = c("GALLATIN",
"GALLATIN", "GALLATIN", "GALLATIN", "GALLATIN", "GALLATIN"),
UNCOATED = c("UNCOATED", "UNCOATED", "UNCOATED", "UNCOATED",
"UNCOATED", "COATED"), UNCOATED.1 = c("UNCOATED", "COATED",
"UNCOATED", "COATED", "UNCOATED", "COATED"), NO = c("NO",
"NO", "NO", "NO", "NO", "NO"), LINE = c("LINE", "LINE", "LINE",
"LINE", "LINE", "LINE"), YES.2 = c("YES", "YES", "YES", "YES",
"YES", "YES"), Motter94 = c("Motter94", "WoodHoe70", "WoodHoe70",
"WoodHoe70", "WoodHoe70", "Motter94"), TABLOID = c("TABLOID",
"CATALOG", "CATALOG", "TABLOID", "CATALOG", "TABLOID"), NorthUS = c("NorthUS",
"NorthUS", "NorthUS", NA, "NorthUS", "CANADIAN"), band = c("noband",
"noband", "noband", "noband", "noband", "noband"), X25503 = c(25503L,
47201L, 39039L, 37351L, 38039L, 35751L), X821 = c(821L, 815L,
816L, 816L, 816L, 827L), X2 = c(2L, 9L, 9L, 2L, 2L, 2L),
X1911 = c(NA, NA, 1910L, 1910L, 1910L, 1911L), X46 = c(46L,
40L, 40L, 46L, 40L, 46L), X78 = c(80L, 80L, 75L, 80L, 76L,
75L), X20 = c(20L, 30L, 30L, 30L, 28L, 30L), X1700 = c(1900L,
1850L, 1467L, 2100L, 1467L, 2600L), X40 = c(40L, 40L, 40L,
40L, 40L, 40L), X100 = c(100L, 100L, 100L, 100L, 100L, 100L
), X55 = c(55, 62, 52, 50, 50, 50), X0.2 = c(0.3, 0.433,
0.3, 0.3, 0.267, 0.3), X17 = c(15, 16, 16, 17, 16.8, 16.5
), X0.75 = c(0.75, NA, 0.3125, 0.75, 0.4375, 0.75), X13.1 = c(6.6,
6.5, 5.6, 0, 8.6, 0), X50.5 = c(54.9, 53.8, 55.6, 57.5, 53.8,
62.5), X36.4 = c(38.5, 39.8, 38.8, 42.5, 37.6, 37.5), X0 = c(0,
0, 0, 5, 5, 6), X0.1 = c(0, 0, 0, 0, 0, 0), X2.5 = c(2.5,
2.8, 2.5, 2.3, 2.5, 2.5), X1 = c(0.7, 0.9, 1.3, 0.6, 0.8,
0.6), X34 = c(34, 40, 40, 35, 40, 30), X105 = c(105, 103.87,
108.06, 106.67, 103.87, 106.67)), row.names = c(NA, 6L), class = "data.frame")
data
The defined function is ( x and y are indices of rows ) :
mixed_similarity_distance<-function(data=data,x,y){
length_charachter_part=length(which(sapply(data,class)=="character"))
comparison<-c(data[x,1:length_charachter_part]==data[y,1:length_charachter_part])
char_distance=length_charachter_part-table(comparison)["TRUE"]
numerical_distance=dist(rbind(data[x,-c(1:length_charachter_part)],data[y,-c(1:length_charachter_part)]))
total_distance=numerical_distance+char_distance
return(total_distance)
}
Example of computing distances :
mixed_similarity_distance(data=data,1,1) # output 0
mixed_similarity_distance(data=data,2,2) # output 0
mixed_similarity_distance(data=data,3,1) # distance between the first and the third rows.
Using all possible pairs of rows , I'm wanting to compute the distance matrix .
I tried :
distance_matrix <- Vectorize(mixed_similarity_distance, c("x", "y"))
distance_matrix(1:nrow(data), 1:nrow(data), data)
I hope my question is clear !
Thank you for help !
Upvotes: 0
Views: 155
Reputation: 1101
You can try the following using apply
function and expand.grid
#Computing all distances
res <- apply(expand.grid(1:6,1:6), 1, function(x) {
mixed_similarity_distance(data = data, x[1],x[2])})
#Convert res into a matrix
matrix(res,nrow = 6,ncol = 6,byrow = TRUE)
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.00 22712.811 13851.308 12122.0171 12829.3965 10508.754
[2,] 22712.81 0.000 8554.237 10316.7203 9599.7534 12015.549
[3,] 13851.31 8554.237 0.000 1808.8448 1001.0576 3485.796
[4,] 12122.02 10316.720 1808.845 1.0000 941.0047 1681.372
[5,] 12829.40 9599.753 1001.058 941.0047 0.0000 2561.244
[6,] 10508.75 12015.549 3485.796 1681.3721 2561.2440 0.000
Upvotes: 1