Reputation: 23
It's been two days since I'm trying to find this :
I have a dataframe with more than 2 mil observations with this structure
id = c(1,2,3,4,5,6,7,8,9,10,11,12)
group = c(1,1,1,1,2,2,2,2,3,3,3,3)
sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F')
time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)
I would like to find for each female the male with the closest time and this by group
By example here id = 2 is a female in the group 1 with time = 11 the closest male in the group 1 is id = 3
ect for each female in each group
I tried to use something like this
keep <- function(x){
a <- df[which.min(abs(df[which(df[,'sex'] == "M"),'time']-x[,'time'])),]
return(a)
}
apply(df, 1, keep)
But it does not work.
If someone can help me it would be great.
Upvotes: 2
Views: 970
Reputation: 27732
data.table
solution using a rolling join to nearest time.
Using the df
from Thomas' answer
setDT(df)
df[sex=="F",][,closestM_id := df[sex=="M",][df[sex=="F",],
x.id,
on = .(group, time), roll = "nearest"]]
# id group sex time closestM_id
# 1: 2 1 F 11.0 3
# 2: 6 2 F 15.0 5
# 3: 7 2 F 9.0 5
# 4: 8 2 F 7.4 5
# 5: 12 3 F 21.0 9
Upvotes: 1
Reputation: 1433
Restructuring the data would help. Create a separate data frame for each sex, create a third data set with all unique pairings of males and females, then merge and subset to narrow it down to the desired pairs. expand.grid
is very handy for computing these sorts of combinations, after that, dplyr
functions can be used to handle the rest of the logic.
library(dplyr)
# create one data set for females
females <- df %>%
filter(sex == "F") %>%
select(f_id = id, f_time = time, f_group = group)
# create one data set for males
males <- df %>%
filter(sex == "M") %>%
select(m_id = id, m_time = time, m_group = group)
# All possible pairings of males and females
pairs <- expand.grid(f_id = females %>% pull(f_id),
m_id = males %>% pull(m_id),
stringsAsFactors = FALSE)
# Merge in information about each individual
pairs <- pairs %>%
left_join(females, by = "f_id") %>%
left_join(males, by = "m_id") %>%
# eliminate any pairings that are in different groups
filter(f_group == m_group)
pairs
Result, potential pairs
f_id m_id f_time f_group m_time m_group
1 2 1 11.0 1 10.0 1
2 2 3 11.0 1 11.5 1
3 2 4 11.0 1 13.0 1
4 6 5 15.0 2 13.2 2
5 7 5 9.0 2 13.2 2
6 8 5 7.4 2 13.2 2
7 12 9 21.0 3 18.0 3
8 12 10 21.0 3 12.0 3
9 12 11 21.0 3 34.5 3
# compute distances and
# subset for the closest male to each female
pairs %>%
mutate(diff = abs(m_time - f_time)) %>%
group_by(f_id) %>%
filter(diff == min(diff)) %>%
select(m_id, f_id)
Output, the closest pairs
# A tibble: 5 x 2
# Groups: f_id [5]
m_id f_id
<dbl> <dbl>
1 3 2
2 5 6
3 5 7
4 5 8
5 9 12
Upvotes: 0
Reputation: 101247
Are you after something like below?
setDT(df)[
,
c(
.SD[sex == "F"],
.(closestM_id = id[sex == "M"][max.col(-abs(outer(
time[sex == "F"],
time[sex == "M"], "-"
)))])
), group
]
which gives
group id sex time closestM_id
1: 1 2 F 11.0 3
2: 2 6 F 15.0 5
3: 2 7 F 9.0 5
4: 2 8 F 7.4 5
5: 3 12 F 21.0 9
Data
> dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), sex = c("M",
"F", "M", "M", "M", "F", "F", "F", "M", "M", "M", "F"), time = c(10,
11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)), class = "data.frame", row.names = c(NA,
-12L))
Upvotes: 0
Reputation: 1816
You could split the data.frame()
into groups of males and females then use outer()
to find the absolute difference in time of all combinations.
Code:
lapply(split(df, df[, "group"]), function(x){
# split by sex
tmp1 <- split(x, x[, "sex"])
# time difference for every combination
tmp2 <- abs(t(outer(tmp1[["M"]][, "time"], tmp1[["F"]][, "time"], "-")))
# find minimum for each woman (rowwise minimum)
# and connect those numbers with original ID in input data.frame
tmp3 <- tmp1[["M"]][apply(tmp2, 1, which.min), ]
# ronames to represent female ID
rownames(tmp3) <- tmp1[["F"]][, "id"]
# return
tmp3
})
# $`1`
# id group sex time
# 2 3 1 M 11.5
#
# $`2`
# id group sex time
# 6 5 2 M 13.2
# 7 5 2 M 13.2
# 8 5 2 M 13.2
#
# $`3`
# id group sex time
# 12 9 3 M 18
Now each group has its own data.frame()
. The rownames()
represent the ID
of the woman and the respective row of the man in the data.frame()
with the smallest absolute difference in time.
Data
df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12),
group = c(1,1,1,1,2,2,2,2,3,3,3,3),
sex = c('M','F', 'M', 'M', 'M', 'F', 'F', 'F', 'M', 'M', 'M', 'F'),
time = c(10, 11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21))
Upvotes: 0