wooden05
wooden05

Reputation: 195

Most efficient way of determing which ID does not have a pair?

Say that I have a dataframe that looks like the one below. In the dataframe we have the following pairs of IDs (4330, 4331), (2333,2334), (3336,3337), which are +/- 1 of each other. However, 3349 does not have pair. What would be the most efficient way of filtering out unpaired IDs?

   ID sex zyg race SES
1 4330   2   2    2   1
2 4331   2   2    2   1
3 2333   2   2    1  78
4 2334   2   2    1  78
5 3336   2   2    1  18
6 3337   2   2    1  18
6 3349   2   2    1  18

Upvotes: 0

Views: 75

Answers (2)

jblood94
jblood94

Reputation: 16981

This will return only pairs/twins (no unpaired or triplets, quadruplets, etc.). In base R:

df <- data.frame(ID = c(1:3, 4330, 4331, 2333, 2334, 3336, 3337, 3349), sex = 2)
df <- df[order(df$ID),]
df[
  rep(
    with(
      rle(diff(df$ID)),
      cumsum(lengths)[lengths == 1L & values == 1]
    ), each = 2
  ) + 0:1,
]
#>     ID sex
#> 6 2333   2
#> 7 2334   2
#> 8 3336   2
#> 9 3337   2
#> 4 4330   2
#> 5 4331   2

Explanation:

After sorting the data, only individuals in a group (a twin, triplet, etc.) will have an ID difference of 1 from the individual in the next row. diff(df$ID) returns the difference in ID value from one row to the next along the whole data.frame. To identify twins, we want to find where diff(df$ID) has a 1 that is by itself (i.e., neither the previous value nor the next value is also 1). We use rle to find those lone 1s:

rle(diff(df$ID))
#> Run Length Encoding
#>   lengths: int [1:8] 2 1 1 1 1 1 1 1
#>   values : num [1:8] 1 2330 1 1002 1 12 981 1

Lone 1s occur when both the value of diff(df$ID) (values) and the length of runs of the same value (lengths) are both 1. This occurs with the third, fifth, and eighth run. The starting rows (within df) of all runs are given by cumsum(lengths), so we subset them at 3, 5, and 8 to get the starting index of each twin pair in df. We repeat each of those indices twice with rep(..., each = 2) then add 0:1 (taking advantage of recycling in R) to get the indices of any individual who is a twin.

Upvotes: 2

zephryl
zephryl

Reputation: 17069

Using dplyr::lag() and lead(), you can filter() to rows where the previous ID is ID - 1 or the next ID is ID + 1:

library(dplyr)

df %>% 
  filter(lag(ID) == ID - 1 | lead(ID) == ID + 1)
# A tibble: 6 × 5
     ID   sex   zyg  race   SES
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  4330     2     2     2     1
2  4331     2     2     2     1
3  2333     2     2     1    78
4  2334     2     2     1    78
5  3336     2     2     1    18
6  3337     2     2     1    18

*edit, this will not filter out "triplets," "quadruplets," etc., contrary to the additional requirements mentioned in the comments.

Upvotes: 0

Related Questions