Reputation: 167
this data has two columns. One with participant IDs and second with dates. I want to convert the dates to columns and within each column I want exact date of that ID. e.g. t1 march to may 2020, t2 June to sep, t3 oct to dec 2020, t4 jan to feb 2021, t5 march to June 2021. This is a dummy code as in my real data, I have ID as character variable rather than numerical eg CASK-01. Say CASK-01 is repeated six times with different dates, so I want that t1 column has that particular which falls into its range and so on.
n <- 6 #in actual data frequencies range from 2-8 or IDs are repeated 2-8 times rather than a fixed number
df <- data.frame(id=rep(1:500, each=n),
date=sample(seq(as.Date('2020/03/01'), as.Date('2021/07/30'), by="day"), 12))
Here is the structure of my real data:
str(df)
tibble [4,347 × 2] (S3: tbl_df/tbl/data.frame)
$ id : chr [1:4347] "CI-001" "CI-002" "CI-003" "CI-004" ...
$ sampledate: POSIXct[1:4347], format: "2020-03-06" "2020-03-06" "2020-03-06" "2020-03-06" ...
I tried to find relevant code but could not find it so far. I could do it manually in excel, in fact I already did that by sorting and then copy-pasting exact date there but I want to do that in R. What do you think? I have posted a dummy result picture as well. Basically first column will be participant id and next five columns are for time, i.e. t1-5 with date ranges. t1 march to may 2020, t2 June to sep, t3 oct to dec 2020, t4 jan to feb 2021, t5 march to June 2021. So dates linked to each participant ID will go into respective time column, t1-5. E.g., if cas-03 has dates: 5 feb 2021 (it’ll show up in t4 column), and 7 and 9 March 2020 (both in t1 but in two row). At the moment each ID is repeated as rows for new linked-date but new table will have IDs repeated in rows only if there’re multiple linked dates in a time column date range. Is it too complicated to do in R. What do think?
Recently I tried this code but it said: Error in filter(., id <= 3) : object 'id' not found In addition: Warning message: In data.matrix(data) : NAs introduced by coercion
# change to your df to have different rep per IDs
df <- df %>%
filter(id <= 3) %>%
mutate(id = ifelse(id == 3, 2, id))
df <- df %>%
group_by(id) %>%
mutate(t = paste0("t", row_number())) %>%
arrange(t) %>%
ungroup()
df %>%
pivot_wider(names_from = t, values_from = date) %>%
head()
Upvotes: 0
Views: 247
Reputation: 794
Well assuming you have exactly the same number of observations per ID (which is the case in your example since df %>% group_by(id) %>% summarise(count = n()) %>% pull(count) %>% unique()
= 6) I would advise you to use the tidyr
package as follow:
library(tidyr)
n <- 6
df <- data.frame(id=rep(1:500, each=n),
date=sample(seq(as.Date('2020/03/01'), as.Date('2021/07/30'), by="day"), 12))
df %>% group_by(id) %>% summarise(count = n()) %>% pull(count) %>% unique()
df <- df %>%
group_by(id) %>%
mutate(t = paste0("t", row_number())) %>%
arrange(t)
df %>%
pivot_wider(names_from = t, values_from = date)
Have a look here: https://tidyr.tidyverse.org/articles/pivot.html
EDIT to your comment
If you do not have the same number of repetition per IDs, that's no big deal. pivot_wider()
is smart enought and will add NA values in the output, example:
# change to your df to have different rep per IDs
df <- df %>%
filter(id <= 3) %>%
mutate(id = ifelse(id == 3, 2, id))
df <- df %>%
group_by(id) %>%
mutate(t = paste0("t", row_number())) %>%
arrange(t) %>%
ungroup()
df %>%
pivot_wider(names_from = t, values_from = date) %>%
head()
will return
# A tibble: 2 x 13
id t1 t10 t11 t12 t2 t3 t4 t5 t6 t7 t8
<dbl> <date> <date> <date> <date> <date> <date> <date> <date> <date> <date> <date>
1 1 2021-04-06 NA NA NA 2020-10-16 2020-03-17 2021-03-20 2020-11-04 2021-03-22 NA NA
2 2 2021-07-26 2021-03-20 2020-11-04 2021-03-22 2021-07-02 2020-03-09 2020-05-13 2021-04-23 2020-04-09 2021-04-06 2020-10-16
# ... with 1 more variable: t9 <date>
Upvotes: 0