FranckLM
FranckLM

Reputation: 23

Find the closest value in the group for each value in the group R

It's been two days since I'm trying to find this :

I have a dataframe with more than 2 mil observations with this structure

id = c(1,2,3,4,5,6,7,8,9,10,11,12)
group = c(1,1,1,1,2,2,2,2,3,3,3,3)
sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F')
time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)

I would like to find for each female the male with the closest time and this by group

By example here id = 2 is a female in the group 1 with time = 11 the closest male in the group 1 is id = 3

ect for each female in each group

I tried to use something like this

keep <- function(x){
   a <-  df[which.min(abs(df[which(df[,'sex'] == "M"),'time']-x[,'time'])),]
   return(a) 
}

apply(df, 1, keep)

But it does not work.

If someone can help me it would be great.

Upvotes: 2

Views: 970

Answers (4)

Wimpel
Wimpel

Reputation: 27732

data.table solution using a rolling join to nearest time.
Using the df from Thomas' answer

setDT(df)
df[sex=="F",][,closestM_id := df[sex=="M",][df[sex=="F",], 
                                            x.id, 
                                            on = .(group, time), roll = "nearest"]]
#    id group sex time closestM_id
# 1:  2     1   F 11.0           3
# 2:  6     2   F 15.0           5
# 3:  7     2   F  9.0           5
# 4:  8     2   F  7.4           5
# 5: 12     3   F 21.0           9

Upvotes: 1

Damian
Damian

Reputation: 1433

Restructuring the data would help. Create a separate data frame for each sex, create a third data set with all unique pairings of males and females, then merge and subset to narrow it down to the desired pairs. expand.grid is very handy for computing these sorts of combinations, after that, dplyr functions can be used to handle the rest of the logic.

library(dplyr)

# create one data set for females
females <- df %>%
    filter(sex == "F") %>%
    select(f_id = id, f_time = time, f_group = group)

# create one data set for males
males <- df %>%
    filter(sex == "M") %>%
    select(m_id = id, m_time = time, m_group = group)

# All possible pairings of males and females
pairs <- expand.grid(f_id = females %>% pull(f_id),
                     m_id = males %>% pull(m_id),
                     stringsAsFactors = FALSE) 

# Merge in information about each individual
pairs <- pairs %>%
    left_join(females, by = "f_id") %>%
    left_join(males, by = "m_id") %>%
    # eliminate any pairings that are in different groups
    filter(f_group == m_group) 

pairs

Result, potential pairs

  f_id m_id f_time f_group m_time m_group
1    2    1   11.0       1   10.0       1
2    2    3   11.0       1   11.5       1
3    2    4   11.0       1   13.0       1
4    6    5   15.0       2   13.2       2
5    7    5    9.0       2   13.2       2
6    8    5    7.4       2   13.2       2
7   12    9   21.0       3   18.0       3
8   12   10   21.0       3   12.0       3
9   12   11   21.0       3   34.5       3


# compute distances and 
# subset for the closest male to each female
pairs %>% 
    mutate(diff = abs(m_time - f_time)) %>%
    
    group_by(f_id) %>%
    filter(diff == min(diff)) %>%
    select(m_id, f_id)
    

Output, the closest pairs

# A tibble: 5 x 2
# Groups:   f_id [5]
   m_id  f_id
  <dbl> <dbl>
1     3     2
2     5     6
3     5     7
4     5     8
5     9    12

Upvotes: 0

ThomasIsCoding
ThomasIsCoding

Reputation: 101247

Are you after something like below?

setDT(df)[
  ,
  c(
    .SD[sex == "F"],
    .(closestM_id = id[sex == "M"][max.col(-abs(outer(
      time[sex == "F"],
      time[sex == "M"], "-"
    )))])
  ), group
]

which gives

   group id sex time closestM_id
1:     1  2   F 11.0           3
2:     2  6   F 15.0           5
3:     2  7   F  9.0           5
4:     2  8   F  7.4           5
5:     3 12   F 21.0           9

Data

> dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
    group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), sex = c("M",
    "F", "M", "M", "M", "F", "F", "F", "M", "M", "M", "F"), time = c(10,
    11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)), class = "data.frame", row.names = c(NA,
-12L))

Upvotes: 0

fabla
fabla

Reputation: 1816

You could split the data.frame() into groups of males and females then use outer() to find the absolute difference in time of all combinations.

Code:

lapply(split(df, df[, "group"]), function(x){
  # split by sex
  tmp1 <- split(x, x[, "sex"])
  
  # time difference for every combination
  tmp2 <- abs(t(outer(tmp1[["M"]][, "time"], tmp1[["F"]][, "time"], "-")))
  
  # find minimum for each woman (rowwise minimum)
  # and connect those numbers with original ID in input data.frame
  tmp3 <- tmp1[["M"]][apply(tmp2, 1, which.min), ]
  
  # ronames to represent female ID
  rownames(tmp3) <- tmp1[["F"]][, "id"]
  
  # return
  tmp3
})

# $`1`
#   id group sex time
# 2  3     1   M 11.5
#
# $`2`
#   id group sex time
# 6  5     2   M 13.2
# 7  5     2   M 13.2
# 8  5     2   M 13.2
#
# $`3`
#    id group sex time
# 12  9     3   M   18

Now each group has its own data.frame(). The rownames() represent the ID of the woman and the respective row of the man in the data.frame() with the smallest absolute difference in time.

Data

df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
                 group = c(1,1,1,1,2,2,2,2,3,3,3,3),
                 sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F'),
                 time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21))

Upvotes: 0

Related Questions