Reputation: 939
I have a piece of code that can be written by using the "iris" dataset (my dataset is much larger though).
The idea is to calculate all the possible distances between all the elements of a species against all the elements of the other species.
The distance, for element1 and element2, is calculated using the geometric distance as:
distance=((Sepal_Width_element1-Sepal_Width_element2)^2 + (Petal_Length_element1-Petal_Length_element2)^2)^0.5
If such distance is lower than the threshold we store such value for further analysis.
I was able to write the task using for loops, but I believe that one of the "apply" functions can speed up the code and make it more readable.
The code is the following:
threshold=20
lsp_names=unique(iris$Species) #define a vector of names for each unique species
for(n1 in 1:(length(lsp_names)-1) ){
for(n2 in (n1+1):length(lsp_names)){
n_spec1=lsp_names[n1] ## name of the species 1
n_spec2=lsp_names[n2] ## name of the species 2
## generate a data frame for the species with name n_spec1
gph1=iris[iris$Species==n_spec1,]
## generate a data frame for the species with name n_spec2
gph2=iris[iris$Species==n_spec2,]
dist_values=NULL
## loop trough all possible couples between the species n_spec1 and n_spec2,
## evaluate the distance and store it if it is lower than the threshold
for(i in 1:nrow(gph1)){
for(j in 1:nrow(gph2)){
d=((gph1$Sepal.Width[i]-gph2$Sepal.Width[j])^2 + (gph1$Petal.Length[i]-gph2$Petal.Length[j])^2)^0.5
if(d<=threshold){
dist_values=c(dist_values,d)
}
}
}
## print a summary
a_std=sd(dist_values)
a_mean=mean(dist_values)
message(sprintf("distance between:%s %s mean:%f sigma:%f\n", n_spec1, n_spec2, a_mean, a_std))
}
}
My output:
distance between:setosa versicolor mean:2.923921 sigma:0.442498
distance between:setosa virginica mean:4.146551 sigma:0.557892
distance between:versicolor virginica mean:1.389100 sigma:0.703420
Upvotes: 1
Views: 472
Reputation: 3973
I've rewritten your code to get rid of the for loops in both the species loop and the individuals loop.
First of all, looping through individuals to calculate the distances is not necessary. because the dist()
function, which is vectorized, can find the Euclidean distance between all pairs of rows in a matrix, as you are doing.
Second, we can use the combn()
function to make a list of all species pairs and then find the average distance between all individuals in each of the pairs.
Here is the code (I didn't include your summary printing message).
# Set threshold
threshold <- 20
# Calculate Euclidean distances between all pairs of individuals
iris_dist <- dist(iris[, c('Sepal.Width', 'Petal.Length')], method = 'euclidean')
# Convert the dist object to a matrix so we can index it by 2 dimensions
iris_matrix <- as.matrix(iris_dist)
# Find all possible pairs of species in iris dataset, as a list
iris_species <- unique(iris$Species)
iris_pairs <- combn(iris_species, 2, simplify = FALSE)
# Calculate mean and standard deviation of distances of individuals of each pair of species
lapply(iris_pairs, function(species) {
# Subset distance matrix for the pair
dist_values <- iris_matrix[iris$Species %in% species[1], iris$Species %in% species[2]]
# Retain only values below threshold
dist_values <- dist_values[dist_values <= threshold]
# Calculate mean and SD, and return
return(c(mean = mean(dist_values), sd = sd(dist_values)))
})
List of species pairs (iris_pairs
):
[[1]]
[1] setosa versicolor
[[2]]
[1] setosa virginica
[[3]]
[1] versicolor virginica
List of mean and standard deviations of cross-species distances:
[[1]]
mean sd
2.9239210 0.4424981
[[2]]
mean sd
4.1465515 0.5578919
[[3]]
mean sd
1.3891000 0.7034196
Upvotes: 1
Reputation: 1502
How about this? You can join the dataset to itself then filter out the rows where the species are the same so as to get a wide dataframe where the species are different:
ii <- merge(iris, iris, by = NULL, all = TRUE)
ii <- ii[with(ii, as.numeric(ii$Species.x) < as.numeric(ii$Species.y)), ]
Then you can calculate the distances between the elements:
ii$dist <- sqrt((ii$Petal.Width.x - ii$Petal.Width.y)^2 + (ii$Sepal.Length.x - ii$Sepal.Length.y)^2)
And finally you can calculate the mean distance by species pair:
aggregate(ii[, 'dist'], by = list(ii$Species.x, ii$Species.y), mean)
Group.1 Group.2 x
1 setosa versicolor 1.497375
2 setosa virginica 2.441457
3 versicolor virginica 1.166729
I don't get the values you get, though, and I don't know why.
Upvotes: 2