spore234
spore234

Reputation: 3640

select repeated row observations with the least absolute difference

I have a data frame like this:

df <- data.frame(id = c(1,1,1,2,2,3,3,3,3),
                  vars = c(1,2,5, 1,3, 0,2,4,-1))

> df
  id vars
1  1    1
2  1    2
3  1    5
4  2    1
5  2    3
6  3    0
7  3    2
8  3    4
9  3   -1

In this data frame each id can have several observations. I now want to select for each id the pair (2 observations) that have the least absolute difference for vars.

In the above case that would be

  id vars
1  1    1
2  1    2
3  2    1
4  2    3
5  3    0
6  3   -1

for id 1, values 1 and 2 have the lowest absolute difference, id 2 only has 2 observations so both are automatically selected. for the id 3 the selected vars would be 0 and -1 because the absolute difference is 1, lower than all other combinations.

Upvotes: 4

Views: 125

Answers (2)

IceCreamToucan
IceCreamToucan

Reputation: 28695

You don't need to do all the comparisons (or, you can let arrange do your comparisons for you), because once you've sorted the values each value is already beside the value for which the difference is minimized.

df %>% 
  group_by(id) %>% 
  arrange(vars) %>% 
  slice(which.min(diff(vars)) + 0:1)

# # A tibble: 6 x 2
# # Groups:   id [3]
#      id  vars
#   <dbl> <dbl>
# 1     1     1
# 2     1     2
# 3     2     1
# 4     2     3
# 5     3    -1
# 6     3     0

data.table version

library(data.table)
setDT(df)

df[df[order(vars), .I[which.min(diff(vars)) + 0:1], id]$V1]

#    id vars
# 1:  3   -1
# 2:  3    0
# 3:  1    1
# 4:  1    2
# 5:  2    1
# 6:  2    3

Upvotes: 7

boski
boski

Reputation: 2467

Not the most concise but works. Probably somebody can improve the idea.

df1%>%group_by(id)%>%mutate(vars2=which.min(abs(diff(combn(num(vars),2)))))%>%
  mutate(vars1=ifelse(vars%in%combn(num(vars),2)[,vars2],vars,NA))%>%select(id,vars1)%>%.[complete.cases(.),]

# A tibble: 6 x 2
# Groups:   id [3]
     id vars1
  <dbl> <dbl>
1     1     1
2     1     2
3     2     1
4     2     3
5     3     0
6     3    -1

The main idea is to do the difference on all the possible combinations of the values of each group. vars2 keeps the column with the lowest difference. If the value is one of the two present in the vars2 column, it is kept. Else, it is set as NA. Then, only complete cases are returned.

Upvotes: 1

Related Questions