user13815308
user13815308

Reputation: 11

selecting duplicates by dates

enter image description here

Hi, I just need help to create desired output from the sample data.

Any help is greatly appreciated.

Upvotes: 0

Views: 54

Answers (1)

r2evans
r2evans

Reputation: 160407

As already mentioned in comments, this "violates" the premise of tidy data to a certain extent, though it is less a problem if you are doing this at report-generation time (where wider layouts may be more aesthetically functional).

Try this:

library(dplyr)
# library(tidyr) # pivot_wider
out <- dat %>%
  # set up for determining groups of columns
  group_by(ID) %>%
  mutate(
    n = row_number(),
    IDn = paste(n, ID, sep = "_")
  ) %>%
  ungroup() %>%
  # pivot so that each group of columns is now in a wider format
  tidyr::pivot_wider(ID, names_from = "n", values_from = c("IDn", "Date", "Score")) %>%
  # convert from the solitary 'ID' field to the duplicated 'IDn' fields,
  # and fix the augmented ID _values_ to remove the 'n' component
  select(-ID) %>%
  mutate(across(starts_with("IDn"), ~ sub(".*_", "", .))) %>%
  # order them based on the trailing number, natural sort after that
  select(colnames(.)[ order(sub(".*_", "", colnames(.))) ])
out
# # A tibble: 10 x 9
#    IDn_1 Date_1 Score_1 IDn_2 Date_2 Score_2 IDn_3 Date_3 Score_3
#    <chr> <chr>  <chr>   <chr> <chr>  <chr>   <chr> <chr>  <chr>  
#  1 101   44656  10      101   44660  50      101   44676  65     
#  2 102   44656  30      102   44678  66      NA    NA     NA     
#  3 103   44658  40      103   44673  32      NA    NA     NA     
#  4 104   44659  20      104   44682  65      NA    NA     NA     
#  5 105   44661  35      NA    NA     NA      NA    NA     NA     
#  6 106   44661  45      NA    NA     NA      NA    NA     NA     
#  7 107   44682  51      NA    NA     NA      NA    NA     NA     
#  8 108   44683  25      NA    NA     NA      NA    NA     NA     
#  9 109   44684  39      NA    NA     NA      NA    NA     NA     
# 10 110   44685  43      NA    NA     NA      NA    NA     NA     

The results have two things "wrong":

  • The ID columns have an n in there, mostly because of how I chose to duplicate them across each group of columns.

  • All columns have _# appended. R in general prefers its column names to be unique. It can be coerced, but it makes any references to columns from that point on either ambiguous (if using column names) or requiring integer column indices (which can be fragile).

  • However, dplyr and most of the tidyverse in general more-strongly prefer against duplicate names, so much so that I couldn't find an elegant/generic select(..) to do it in one step.

If you must have the names as in your question, it can be done outside of dplyr with:

names(out) <- sub("n?_.*", "", names(out))
out
# # A tibble: 10 x 9
#    ID    Date  Score ID    Date  Score ID    Date  Score
#    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#  1 101   44656 10    101   44660 50    101   44676 65   
#  2 102   44656 30    102   44678 66    NA    NA    NA   
#  3 103   44658 40    103   44673 32    NA    NA    NA   
#  4 104   44659 20    104   44682 65    NA    NA    NA   
#  5 105   44661 35    NA    NA    NA    NA    NA    NA   
#  6 106   44661 45    NA    NA    NA    NA    NA    NA   
#  7 107   44682 51    NA    NA    NA    NA    NA    NA   
#  8 108   44683 25    NA    NA    NA    NA    NA    NA   
#  9 109   44684 39    NA    NA    NA    NA    NA    NA   
# 10 110   44685 43    NA    NA    NA    NA    NA    NA   

(Sure, it's possible to put that somehow in a dplyr pipe, but I don't know if that buys you much.)


Data (from the OP comments)

dat <- structure(list(ID = c("101", "102", "103", "104", "101", "105", "106", "103", "101", "102", "104", "107", "108", "109", "110" ), Date = c("44656", "44656", "44658", "44659", "44660", "44661", "44661", "44673", "44676", "44678", "44682", "44682", "44683", "44684", "44685"), Score = c("10", "30", "40", "20", "50", "35", "45", "32", "65", "66", "65", "51", "25", "39", "43")), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 1

Related Questions