Catherine
Catherine

Reputation: 1

How to obtain minimum difference between 2 columns

I want to obtain the minimum distance between 2 columns, however the same name may appear in both Column A and Column B. See example below;

Patient1    Patient2    Distance
A           B           8
A           C           11
A           D           19
A           E           23
B           F           6
C           G           25

So the output I need is:

Patient Patient_closest_distance Distance
A       B                        8
B       F                        6
c       A                        11

I have tried using the list function

library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]

However, I just get the minimum distance for each column, i.e. C will have 2 results as it is in both columns rather than showing the closest patient considering both columns. Also, I only get a list of distances, so I can't see which patient is linked to which;

Patient1 SNP

1: A 8

I have tried using the list function in R Studio

library(data.table)
DT <- data.table(Full_data)
j1 <- DT[ , list(Distance = min(Distance)), by = Patient1]
j2 <- DT[ , list(Distance = min(Distance)), by = Patient2]

Upvotes: 0

Views: 132

Answers (1)

Monk
Monk

Reputation: 407

This code below works.

# Create sample data frame
df <- data.frame(
  Patient1 = c('A','B', 'A', 'A', 'C', 'B'),
  Patient2 = c('B', 'A','C', 'D', 'D', 'F'),
  Distance = c(10, 1, 20, 3, 60, 20)
)
# Format as character variable (instead of factor)
df$Patient1 <- as.character(df$Patient1); df$Patient2 <- as.character(df$Patient2);

# If you want mirror paths included, you'll need to add them.
# Ex.) A to C at a distance of 20 is equivalent to C to A at a distance of 20
# If you don't need these mirror paths, you can ignore these two lines.
df_mirror <- data.frame(Patient1 = df$Patient2, Patient2 = df$Patient1, Distance = df$Distance)
df <- rbind(df, df_mirror); rm(df_mirror)

# group pairs by min distance
library(dplyr)
df <- summarise(group_by(df, Patient1, Patient2), min(Distance))

# Resort, min to top.  
nearest <- df[order(df$`min(Distance)`), ]
# Keep only the first of each group
nearest <- nearest[!duplicated(nearest$Patient1),]

Upvotes: 1

Related Questions