pandora
pandora

Reputation: 59

Advice on Logical Data Tidying for a work project

My data set comes from an excel file set up by non-data orientated co-workers. My data is sensitive, so I cannot share the data set, but I will try to make an example so it's easier to see what I'm talking about. I have a unique identifier column. My problem is with the date columns. I have multiple date columns. When imported into R, some of the columns imported as dates, and some of them imported as excel dates. then some of the cells in the columns have a string of dates (in character type). not all the strings are the same size. Some have a list of 2, some a list 7.

I'm trying to Tidy my data. My thoughts were to put all the collection dates in one column. Here is where I'm stuck. I can use pivot_longer() bc not all the columns are the same type. I can't convert the excel dates into R dates w/out getting rid of the list of strings. And I can' get rid of the list of strings, bc I'm running into the Error: Incompatible lengths. I think my logic flow is wrong and any advice to point me in the right direction would be helpful. I can look up coding after i get my flow right.

What I have:

  Unique.ID Collection.Date1 Test.Result1 Collection.Date2 Test.Result2        Collection.Date3
1         1       12/12/2020     positive           1/1/21     negative                   44890
2         2         11/30/20     negative   1/8/21,1/20/21     negative 2/10/21,3/10/21,4/10/21
3         3          1/20/21     negative            44011     positive                   44007
4         4          12/1/20     positive            44018     negative                   44064

What I would like to have:

Unique ID     Collection Date     Test Result

1             12/12/20            positive

1             1/1/21              negative

1             2/2/21              NA

2             11/30/20            negative

2             1/8/21              negative

2             1//20/21            negative

2             2/10/21             NA

2             3/10/21             NA

2             4/10/21             NA

3             1/20/21             negative

3             3/3/21              positive

3             4/3/21              NA

4             12/1/20             positive

4             3/3/21              negative

4             4/3/21              NA

If I try to make everything into dates first, it runs into an error bc of the strings. When I try to expand the strings, since they are different lengths, I get an error doing that.

Upvotes: 1

Views: 216

Answers (1)

Martin Gal
Martin Gal

Reputation: 16978

Since you didn't post real data, we have to make some assumptions on the structure of your dataset.

Data

Assuming, your data looks like

structure(list(Unique.ID = c(1, 2, 3, 4), Collection.Date1 = c("12/12/2020", 
"11/30/20", "1/20/21", "12/1/20"), Test.Result1 = c("positive", 
"negative", "negative", "positive"), Collection.Date2 = c("1/1/21", 
"1/8/21,1/20/21", "44011", "44018"), Test.Result2 = c("negative", 
"negative", "positive", "negative"), Collection.Date3 = c("44890", 
"2/10/21,3/10/21,4/10/21", "44007", "44064")), problems = structure(list(
    row = c(1L, 4L), col = c(NA_character_, NA_character_), expected = c("6 columns", 
    "6 columns"), actual = c("7 columns", "7 columns"), file = c("literal data", 
    "literal data")), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -4L), spec = structure(list(
    cols = list(Unique.ID = structure(list(), class = c("collector_double", 
    "collector")), Collection.Date1 = structure(list(), class = c("collector_character", 
    "collector")), Test.Result1 = structure(list(), class = c("collector_character", 
    "collector")), Collection.Date2 = structure(list(), class = c("collector_character", 
    "collector")), Test.Result2 = structure(list(), class = c("collector_character", 
    "collector")), Collection.Date3 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 2L), class = "col_spec"))

or for better readability

# A tibble: 4 x 6
  Unique.ID Collection.Date1 Test.Result1 Collection.Date2 Test.Result2 Collection.Date3       
      <dbl> <chr>            <chr>        <chr>            <chr>        <chr>                  
1         1 12/12/2020       positive     1/1/21           negative     44890                  
2         2 11/30/20         negative     1/8/21,1/20/21   negative     2/10/21,3/10/21,4/10/21
3         3 1/20/21          negative     44011            positive     44007                  
4         4 12/1/20          positive     44018            negative     44064 

Note: This data already doesn't look all too bad. I really can think of much worse data. If your data has some more issues, the following workflow might fail.

Workflow

I prefer using packages included in tidyverse, in this case dplyr, tidyr and stringr.

Based on your data I would perform the following steps:

  1. Bringing the data into a long format.
  2. Seperate the dates, that are in one cell, namely turning "2/10/21, 3/10/21, 4/10/21" into seperate rows.
  3. Bringing the data into a wide format, turning dates and results into a column each.
  4. Tidy up your dates, especially taking care of the Excel-style dates.

Turned into code this looks like

library(tidyr)
library(dplyr)
library(stringr)

df %>% 
  pivot_longer(-Unique.ID, names_pattern="(.+)(\\d)", names_to=c("name", "no") ) %>%
  mutate(value = str_split(value, ",\\s*")) %>%
  unnest(value) %>%
  pivot_wider(values_fn=list) %>%
  unnest(c(Collection.Date, Test.Result)) %>%
  mutate(Collection.Date = coalesce(as.Date(Collection.Date, "%m/%d/%y"),
                                    as.Date(as.integer(Collection.Date), origin="1900-01-01"))) %>%
  suppressWarnings() %>%
  select(-no)

which returns

# A tibble: 15 x 3
   Unique.ID Collection.Date Test.Result
       <dbl> <date>          <chr>      
 1         1 2020-12-12      positive   
 2         1 2021-01-01      negative   
 3         1 2022-11-27      NA         
 4         2 2020-11-30      negative   
 5         2 2021-01-08      negative   
 6         2 2021-01-20      negative   
 7         2 2021-02-10      NA         
 8         2 2021-03-10      NA         
 9         2 2021-04-10      NA         
10         3 2021-01-20      negative   
11         3 2020-07-01      positive   
12         3 2020-06-27      NA         
13         4 2020-12-01      positive   
14         4 2020-07-08      negative   
15         4 2020-08-23      NA 

Upvotes: 1

Related Questions