Fyn Oster
Fyn Oster

Reputation: 13

Iterate through ID-matched Euclidean distances using dist() in R

I have a dataset that consists of various individuals' ratings of a bunch of variables. Each individual, differentiated by unique ID numbers, rated each of the variables for two targets: for themselves (target = s) and someone else (target = o). A simplified mock-up of the dataframe looks like this:

id <- c("123", "123", "234", "234", "345", "345", "456", "456", "567", "567")
target <- c("s", "o", "s", "o", "s", "o", "s", "o", "s", "o")
v1 <- c(1, 2, 3, 7, 2, 5, 4, 4, 1, 3)
v2 <- c(7, 6, 5, 7, 1, 3, 5, 4, 1, 1)
v3 <- c(2, 2, 2, 4, 5, 2, 7, 1, 3, 3)
df <- data.frame(id, target, v1, v2, v3)

I want to find the Euclidean distance between each individual's self rating and other person rating across all the variables. Ideally, I want the end result to look kind of like this:

id <- c("123", "234", "345", "456", "567")
euclidean_distance <- c(1.414214, 4.898979, 4.690416, 6.082763, 2)
df_final <- data.frame(id, euclidean_distance)

An example of how I'm doing this for one individual would be:

id_123 <- df %>%
  filter(id == 123)
dist(select(id_123, v1:v3))

However, this takes a long time to do one at a time (my actual data set has hundreds of individuals, not just 5) and I'm more likely to make transcription mistakes doing all of this one at a time, by hand. So I'm trying to figure out a way to iterate through all the individuals (so, every unique ID number) to get each individual's one Euclidean distance output value.

Do you have any suggestions about how to achieve this? Any help greatly appreciated!

Upvotes: 1

Views: 76

Answers (3)

VinceGreg
VinceGreg

Reputation: 834

Edit: Afterwards, I prefer @thelatemail 's answer, which summarise with groups.

Here is a solution with purrr::map(). It is not exactly a loop (you can read about Functionals in Advanced R). The ~ .x syntax is outdated, comments are welcome so I could improve!

library(tidyverse)
df %>%
  split(.$id) %>% 
  map(~ .x %>%  select(v1:v3) %>% 
        dist() %>%as.numeric() %>% 
        as_tibble_col(column_name ="euclidean_distance" )) %>% 
  list_rbind(names_to="id")

Nice minimal reproducible example by the way :)

Upvotes: 0

Dave2e
Dave2e

Reputation: 24139

You can vectorize this solution and avoid all loops. First pivot your data wider so each id is on one line.

library(tidyr)
#pivot wider so each ID is on one line
df2<-pivot_wider(df, id_cols = "id", values_from= starts_with("v"),  names_from = "target", names_glue= "{target}_{.value}")

head(df2)

#find the square of the differences between the corresponding s and o columns
squares <- (df2[,c(2, 4, 6)] - df2[,c(3, 5, 7)])^2

#find the square root of sum of the squares (the distance calculation)
#sqrt(rowSums(squares))

answer<-data.frame(id=df2$id, dist = sqrt(rowSums(squares)))
answer


   id     dist
1 123 1.414214
2 234 4.898979
3 345 4.690416
4 456 6.082763
5 567 2.000000

Upvotes: 1

thelatemail
thelatemail

Reputation: 93938

A standard dplyr grouping and summarise should take care of this. Calling dist for each group adds some overhead, but for a couple of hundred groups this should not be a big drama:

df %>% summarise(dist = dist(cbind(v1,v2,v3)), .by=id)
#   id     dist
#1 123 1.414214
#2 234 4.898979
#3 345 4.690416
#4 456 6.082763
#5 567 2.000000

Upvotes: 1

Related Questions