Fabrizio
Fabrizio

Reputation: 939

Converting loop to apply, iris example

I have a piece of code that can be written by using the "iris" dataset (my dataset is much larger though).

The idea is to calculate all the possible distances between all the elements of a species against all the elements of the other species.

The distance, for element1 and element2, is calculated using the geometric distance as:

distance=((Sepal_Width_element1-Sepal_Width_element2)^2 + (Petal_Length_element1-Petal_Length_element2)^2)^0.5

If such distance is lower than the threshold we store such value for further analysis.

I was able to write the task using for loops, but I believe that one of the "apply" functions can speed up the code and make it more readable.

The code is the following:

threshold=20
lsp_names=unique(iris$Species) #define a vector of names for each unique species
for(n1 in 1:(length(lsp_names)-1) ){
  for(n2 in (n1+1):length(lsp_names)){

    n_spec1=lsp_names[n1] ## name of the species 1
    n_spec2=lsp_names[n2] ## name of the species 2
    ## generate a data frame for the species with name n_spec1
    gph1=iris[iris$Species==n_spec1,]
    ## generate a data frame for the species with name n_spec2
    gph2=iris[iris$Species==n_spec2,]

    dist_values=NULL
    ## loop trough all possible couples between the species n_spec1 and n_spec2,
    ## evaluate the distance and store it if it is lower than the threshold
    for(i in 1:nrow(gph1)){
      for(j in 1:nrow(gph2)){
        d=((gph1$Sepal.Width[i]-gph2$Sepal.Width[j])^2 + (gph1$Petal.Length[i]-gph2$Petal.Length[j])^2)^0.5
        if(d<=threshold){
          dist_values=c(dist_values,d)
        }
      }
    }

    ## print a summary
    a_std=sd(dist_values)
    a_mean=mean(dist_values)
    message(sprintf("distance between:%s %s mean:%f  sigma:%f\n", n_spec1, n_spec2, a_mean, a_std))
  }
}

My output:

distance between:setosa versicolor mean:2.923921  sigma:0.442498
distance between:setosa virginica mean:4.146551  sigma:0.557892
distance between:versicolor virginica mean:1.389100  sigma:0.703420

Upvotes: 1

Views: 472

Answers (2)

qdread
qdread

Reputation: 3973

I've rewritten your code to get rid of the for loops in both the species loop and the individuals loop.

First of all, looping through individuals to calculate the distances is not necessary. because the dist() function, which is vectorized, can find the Euclidean distance between all pairs of rows in a matrix, as you are doing.

Second, we can use the combn() function to make a list of all species pairs and then find the average distance between all individuals in each of the pairs.

Code

Here is the code (I didn't include your summary printing message).

# Set threshold
threshold <- 20

# Calculate Euclidean distances between all pairs of individuals
iris_dist <- dist(iris[, c('Sepal.Width', 'Petal.Length')], method = 'euclidean')

# Convert the dist object to a matrix so we can index it by 2 dimensions
iris_matrix <- as.matrix(iris_dist)

# Find all possible pairs of species in iris dataset, as a list
iris_species <- unique(iris$Species)
iris_pairs <- combn(iris_species, 2, simplify = FALSE)

# Calculate mean and standard deviation of distances of individuals of each pair of species
lapply(iris_pairs, function(species) {
  # Subset distance matrix for the pair
  dist_values <- iris_matrix[iris$Species %in% species[1], iris$Species %in% species[2]]
  # Retain only values below threshold
  dist_values <- dist_values[dist_values <= threshold]
  # Calculate mean and SD, and return
  return(c(mean = mean(dist_values), sd = sd(dist_values)))
})

Output

List of species pairs (iris_pairs):

[[1]]
[1] setosa     versicolor

[[2]]
[1] setosa    virginica

[[3]]
[1] versicolor virginica 

List of mean and standard deviations of cross-species distances:

[[1]]
     mean        sd 
2.9239210 0.4424981 

[[2]]
     mean        sd 
4.1465515 0.5578919 

[[3]]
     mean        sd 
1.3891000 0.7034196 

Upvotes: 1

user2474226
user2474226

Reputation: 1502

How about this? You can join the dataset to itself then filter out the rows where the species are the same so as to get a wide dataframe where the species are different:

ii <- merge(iris, iris, by = NULL, all = TRUE)
ii <- ii[with(ii, as.numeric(ii$Species.x) < as.numeric(ii$Species.y)), ]

Then you can calculate the distances between the elements:

ii$dist <- sqrt((ii$Petal.Width.x - ii$Petal.Width.y)^2 + (ii$Sepal.Length.x - ii$Sepal.Length.y)^2)

And finally you can calculate the mean distance by species pair:

aggregate(ii[, 'dist'], by = list(ii$Species.x, ii$Species.y), mean)
     Group.1    Group.2        x
1     setosa versicolor 1.497375
2     setosa  virginica 2.441457
3 versicolor  virginica 1.166729

I don't get the values you get, though, and I don't know why.

Upvotes: 2

Related Questions