Reputation: 59
My data set comes from an excel file set up by non-data orientated co-workers. My data is sensitive, so I cannot share the data set, but I will try to make an example so it's easier to see what I'm talking about. I have a unique identifier column. My problem is with the date columns. I have multiple date columns. When imported into R, some of the columns imported as dates, and some of them imported as excel dates. then some of the cells in the columns have a string of dates (in character type). not all the strings are the same size. Some have a list of 2, some a list 7.
I'm trying to Tidy my data. My thoughts were to put all the collection dates in one column. Here is where I'm stuck. I can use pivot_longer() bc not all the columns are the same type. I can't convert the excel dates into R dates w/out getting rid of the list of strings. And I can' get rid of the list of strings, bc I'm running into the Error: Incompatible lengths. I think my logic flow is wrong and any advice to point me in the right direction would be helpful. I can look up coding after i get my flow right.
What I have:
Unique.ID Collection.Date1 Test.Result1 Collection.Date2 Test.Result2 Collection.Date3
1 1 12/12/2020 positive 1/1/21 negative 44890
2 2 11/30/20 negative 1/8/21,1/20/21 negative 2/10/21,3/10/21,4/10/21
3 3 1/20/21 negative 44011 positive 44007
4 4 12/1/20 positive 44018 negative 44064
What I would like to have:
Unique ID Collection Date Test Result
1 12/12/20 positive
1 1/1/21 negative
1 2/2/21 NA
2 11/30/20 negative
2 1/8/21 negative
2 1//20/21 negative
2 2/10/21 NA
2 3/10/21 NA
2 4/10/21 NA
3 1/20/21 negative
3 3/3/21 positive
3 4/3/21 NA
4 12/1/20 positive
4 3/3/21 negative
4 4/3/21 NA
If I try to make everything into dates first, it runs into an error bc of the strings. When I try to expand the strings, since they are different lengths, I get an error doing that.
Upvotes: 1
Views: 216
Reputation: 16978
Since you didn't post real data, we have to make some assumptions on the structure of your dataset.
Assuming, your data looks like
structure(list(Unique.ID = c(1, 2, 3, 4), Collection.Date1 = c("12/12/2020",
"11/30/20", "1/20/21", "12/1/20"), Test.Result1 = c("positive",
"negative", "negative", "positive"), Collection.Date2 = c("1/1/21",
"1/8/21,1/20/21", "44011", "44018"), Test.Result2 = c("negative",
"negative", "positive", "negative"), Collection.Date3 = c("44890",
"2/10/21,3/10/21,4/10/21", "44007", "44064")), problems = structure(list(
row = c(1L, 4L), col = c(NA_character_, NA_character_), expected = c("6 columns",
"6 columns"), actual = c("7 columns", "7 columns"), file = c("literal data",
"literal data")), row.names = c(NA, -2L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(Unique.ID = structure(list(), class = c("collector_double",
"collector")), Collection.Date1 = structure(list(), class = c("collector_character",
"collector")), Test.Result1 = structure(list(), class = c("collector_character",
"collector")), Collection.Date2 = structure(list(), class = c("collector_character",
"collector")), Test.Result2 = structure(list(), class = c("collector_character",
"collector")), Collection.Date3 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 2L), class = "col_spec"))
or for better readability
# A tibble: 4 x 6
Unique.ID Collection.Date1 Test.Result1 Collection.Date2 Test.Result2 Collection.Date3
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1 12/12/2020 positive 1/1/21 negative 44890
2 2 11/30/20 negative 1/8/21,1/20/21 negative 2/10/21,3/10/21,4/10/21
3 3 1/20/21 negative 44011 positive 44007
4 4 12/1/20 positive 44018 negative 44064
Note: This data already doesn't look all too bad. I really can think of much worse data. If your data has some more issues, the following workflow might fail.
I prefer using packages included in tidyverse
, in this case dplyr
, tidyr
and stringr
.
Based on your data I would perform the following steps:
Turned into code this looks like
library(tidyr)
library(dplyr)
library(stringr)
df %>%
pivot_longer(-Unique.ID, names_pattern="(.+)(\\d)", names_to=c("name", "no") ) %>%
mutate(value = str_split(value, ",\\s*")) %>%
unnest(value) %>%
pivot_wider(values_fn=list) %>%
unnest(c(Collection.Date, Test.Result)) %>%
mutate(Collection.Date = coalesce(as.Date(Collection.Date, "%m/%d/%y"),
as.Date(as.integer(Collection.Date), origin="1900-01-01"))) %>%
suppressWarnings() %>%
select(-no)
which returns
# A tibble: 15 x 3
Unique.ID Collection.Date Test.Result
<dbl> <date> <chr>
1 1 2020-12-12 positive
2 1 2021-01-01 negative
3 1 2022-11-27 NA
4 2 2020-11-30 negative
5 2 2021-01-08 negative
6 2 2021-01-20 negative
7 2 2021-02-10 NA
8 2 2021-03-10 NA
9 2 2021-04-10 NA
10 3 2021-01-20 negative
11 3 2020-07-01 positive
12 3 2020-06-27 NA
13 4 2020-12-01 positive
14 4 2020-07-08 negative
15 4 2020-08-23 NA
Upvotes: 1