Reputation: 11
Hi, I just need help to create desired output from the sample data.
Any help is greatly appreciated.
Upvotes: 0
Views: 54
Reputation: 160407
As already mentioned in comments, this "violates" the premise of tidy data to a certain extent, though it is less a problem if you are doing this at report-generation time (where wider layouts may be more aesthetically functional).
Try this:
library(dplyr)
# library(tidyr) # pivot_wider
out <- dat %>%
# set up for determining groups of columns
group_by(ID) %>%
mutate(
n = row_number(),
IDn = paste(n, ID, sep = "_")
) %>%
ungroup() %>%
# pivot so that each group of columns is now in a wider format
tidyr::pivot_wider(ID, names_from = "n", values_from = c("IDn", "Date", "Score")) %>%
# convert from the solitary 'ID' field to the duplicated 'IDn' fields,
# and fix the augmented ID _values_ to remove the 'n' component
select(-ID) %>%
mutate(across(starts_with("IDn"), ~ sub(".*_", "", .))) %>%
# order them based on the trailing number, natural sort after that
select(colnames(.)[ order(sub(".*_", "", colnames(.))) ])
out
# # A tibble: 10 x 9
# IDn_1 Date_1 Score_1 IDn_2 Date_2 Score_2 IDn_3 Date_3 Score_3
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 101 44656 10 101 44660 50 101 44676 65
# 2 102 44656 30 102 44678 66 NA NA NA
# 3 103 44658 40 103 44673 32 NA NA NA
# 4 104 44659 20 104 44682 65 NA NA NA
# 5 105 44661 35 NA NA NA NA NA NA
# 6 106 44661 45 NA NA NA NA NA NA
# 7 107 44682 51 NA NA NA NA NA NA
# 8 108 44683 25 NA NA NA NA NA NA
# 9 109 44684 39 NA NA NA NA NA NA
# 10 110 44685 43 NA NA NA NA NA NA
The results have two things "wrong":
The ID
columns have an n
in there, mostly because of how I chose to duplicate them across each group of columns.
All columns have _#
appended. R in general prefers its column names to be unique. It can be coerced, but it makes any references to columns from that point on either ambiguous (if using column names) or requiring integer column indices (which can be fragile).
However, dplyr
and most of the tidyverse
in general more-strongly prefer against duplicate names, so much so that I couldn't find an elegant/generic select(..)
to do it in one step.
If you must have the names as in your question, it can be done outside of dplyr
with:
names(out) <- sub("n?_.*", "", names(out))
out
# # A tibble: 10 x 9
# ID Date Score ID Date Score ID Date Score
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 101 44656 10 101 44660 50 101 44676 65
# 2 102 44656 30 102 44678 66 NA NA NA
# 3 103 44658 40 103 44673 32 NA NA NA
# 4 104 44659 20 104 44682 65 NA NA NA
# 5 105 44661 35 NA NA NA NA NA NA
# 6 106 44661 45 NA NA NA NA NA NA
# 7 107 44682 51 NA NA NA NA NA NA
# 8 108 44683 25 NA NA NA NA NA NA
# 9 109 44684 39 NA NA NA NA NA NA
# 10 110 44685 43 NA NA NA NA NA NA
(Sure, it's possible to put that somehow in a dplyr pipe, but I don't know if that buys you much.)
Data (from the OP comments)
dat <- structure(list(ID = c("101", "102", "103", "104", "101", "105", "106", "103", "101", "102", "104", "107", "108", "109", "110" ), Date = c("44656", "44656", "44658", "44659", "44660", "44661", "44661", "44673", "44676", "44678", "44682", "44682", "44683", "44684", "44685"), Score = c("10", "30", "40", "20", "50", "35", "45", "32", "65", "66", "65", "51", "25", "39", "43")), row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 1