Vaibhav Singh
Vaibhav Singh

Reputation: 1209

Convert dataframe into long format one column in R

I have data as given in input section (dput below), need to convert to output with all values of two rows in one long column. I tried using transpose but cells were getting trimmed.

I don't want to hardcode since in future I might have data in 3 or 4 rows in a similar way.

P.S - I also tried pivot_longer but it didnt help

    structure(list(Header = c("Sat 12/3   \n358a-947a\n1017a-229p HRS   10.02", 
"Sat 12/10  \n559a-1106a\n1134a-227p HRS         8.00"), X = c("Sun 12/4   ", 
"Sun 12/11  "), X.1 = c("Mon 12/5   \n548a-1121a\n1149a-618p\n650p-845p HRS   13.95", 
"Mon 12/12  \n500a-1121a\n1151a-547p\n616p-830p HRS   14.53"), 
    X.2 = c("Tue 12/6   \n359a-1120a\n1150a-400p HRS   11.53", 
    "Tue 12/13  \n548a-1120a\n1148a-449p HRS   10.54"), X.3 = c("Wed 12/7   \n548a-1119a\n1149a-515p HRS   10.95", 
    "Wed 12/14  \n429a-1120a\n1150a-432p HRS   11.56"), X.4 = c("Thu 12/8   \n549a-1120a\n1149a-447p HRS   10.48", 
    "Thu 12/15  \n429a-1121a\n1152a-431p HRS   11.52"), X.5 = c("Fri 12/9   \n548a-1120a\n1148a-218p HRS         8.03", 
    "Fri 12/16  \n430a-1120a\n1150a-432p HRS   11.55")), class = "data.frame", row.names = c(NA, 
-2L))

My try (with a little help)

pivot_longer(df, cols = c(1:7)) %>%    
  select(value) %>% 
  mutate(value=str_replace(value,"HRS","")) %>% 
  separate(.,value,into=c("day","entry1","entry2","entry3"),sep="\n") %>% 
  separate(.,entry1,into=c("time_in1","time_out1"),sep="-") %>% 
  separate(.,entry2,into=c("time_in2","time_out2"),sep="-") %>% 
  separate(.,time_out2,into=c("time_out2","duration1"),remove = FALSE,sep="    ",extra = "merge") %>% 
  separate(.,entry3,into=c("time_in3","time_out3"),sep="-") %>% 
  separate(.,time_out3,into=c("time_out3","duration2"),remove = FALSE,sep="    ") %>%
  mutate(duration=coalesce(duration1,duration2)) %>% 
  select(day, duration, time_in1,time_out1,time_in2,time_out2,time_in3,time_out3) %>% 
  separate(.,day,into=c("date","day"),extra="merge") %>% 
  mutate(day=mdy(paste0(day,"2021")),
         duration=str_trim(duration))

Expected outputRequired Output

Upvotes: 2

Views: 487

Answers (1)

Greg
Greg

Reputation: 3326

Approach

The key was tidyr::separate_rows(), which not only separates the cell by "\n" but also splits the components into rows rather than columns.

Here, it is much better to split into rows than into columns. Suppose that most cells have 2 or 3 entries separated by "\n"; but there is a "rogue" cell, with an unusually large number (say 9) of entries, generated by someone who repeatedly clocked in and out throughout the day.

While splitting into columns would create arbitrarily many time_in* | time_out* columns, which remain empty (NA) in all rows except the "rogue"

   date  day        duration time_in1 time_out1 time_in2 time_out2 time_in3 time_out3 time_in4 time_out4 time_in5 time_out5 time_in6 time_out6 time_in7 time_out7 time_in8 time_out8 time_in9 time_out9
   <chr> <date>     <chr>    <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>
#  ...   ...        ...      ...      ...       ...      ...       ...      ...       NA       NA        NA       NA        NA       NA        NA       NA        NA       NA        NA       NA        

splitting into rows will maintain a tame (and stable) columnar structure

   date       day      duration time_in time_out
   <date>     <chr>       <dbl> <chr>   <chr>   
#  ...        ...      ...      ...     ...
#  ...        ...      ...      ...     ...
#  ...        ...      ...      ...     ...

without any "extraneous" columns (or rows).

Solution

Given your sample data df

df <- structure(list(Header = c("Sat 12/3   \n358a-947a\n1017a-229p HRS   10.02", "Sat 12/10  \n559a-1106a\n1134a-227p HRS         8.00"),
                     X = c("Sun 12/4   ", "Sun 12/11  "),
                     X.1 = c("Mon 12/5   \n548a-1121a\n1149a-618p\n650p-845p HRS   13.95", "Mon 12/12  \n500a-1121a\n1151a-547p\n616p-830p HRS   14.53"), 
                     X.2 = c("Tue 12/6   \n359a-1120a\n1150a-400p HRS   11.53", "Tue 12/13  \n548a-1120a\n1148a-449p HRS   10.54"),
                     X.3 = c("Wed 12/7   \n548a-1119a\n1149a-515p HRS   10.95", "Wed 12/14  \n429a-1120a\n1150a-432p HRS   11.56"),
                     X.4 = c("Thu 12/8   \n549a-1120a\n1149a-447p HRS   10.48", "Thu 12/15  \n429a-1121a\n1152a-431p HRS   11.52"),
                     X.5 = c("Fri 12/9   \n548a-1120a\n1148a-218p HRS         8.03", "Fri 12/16  \n430a-1120a\n1150a-432p HRS   11.55")),
                class = "data.frame", row.names = c(NA, -2L))

the following workflow

library(tidyverse)
library(stringr)


# ...
# Code to generate 'df'.
# ...


year_observed <- 2016


results <- df %>%
  mutate(id = row_number()) %>%
  pivot_longer(!id, names_to = "column") %>%
  separate(value, into = c("date", "entries"), sep = "\n", fill = "right", extra = "merge", remove = TRUE) %>%
  separate(entries, into = c("times", "duration"), sep = "HRS", fill = "right", extra = "warn", remove = TRUE) %>%
  mutate(across(date:duration, trimws),
         date = as.Date(paste(str_extract(date, "\\d{1,2}/\\d{1,2}$"), year_observed, sep = "/"), format = "%m/%d/%Y"),
         duration = as.numeric(duration),
         duration = if_else(is.na(duration), 0, duration),
         day = format(date, format = "%a")) %>%
  separate_rows(times, sep = "\n") %>%
  separate(times, into = c("time_in", "time_out"), sep = "-", fill = "warn", extra = "warn", remove = TRUE) %>%
  # ...Further Transformations... %>%
  select(id, date, day, duration, time_in, time_out)


# View results.
results

should yield results like

# A tibble: 28 x 6
      id date       day   duration time_in time_out
   <int> <date>     <chr>    <dbl> <chr>   <chr>   
 1     1 2016-12-03 Sat       10.0 358a    947a    
 2     1 2016-12-03 Sat       10.0 1017a   229p    
 3     1 2016-12-04 Sun        0   NA      NA      
 4     1 2016-12-05 Mon       14.0 548a    1121a   
 5     1 2016-12-05 Mon       14.0 1149a   618p    
 6     1 2016-12-05 Mon       14.0 650p    845p    
 7     1 2016-12-06 Tue       11.5 359a    1120a   
 8     1 2016-12-06 Tue       11.5 1150a   400p    
 9     1 2016-12-07 Wed       11.0 548a    1119a   
10     1 2016-12-07 Wed       11.0 1149a   515p    
# ... with 18 more rows

where id identifies (by row number) the original record in df.

To pivot into your newly specified output, simply execute this code, or append it to the existing workflow:

wide_results <- results %>%
  group_by(id, date) %>% mutate(entry = row_number()) %>% ungroup() %>%
  pivot_wider(id_cols = c(date, day, duration), names_from = entry, names_glue = "{.value}_{entry}", values_from = c(time_in, time_out)) %>%
  # Select so as to alternate between 'time_in_*' and 'time_out_*'.
  select(order(as.numeric(str_extract(colnames(.), "\\d+$")), str_extract(colnames(.), "^time_(in|out)"), na.last = FALSE))


# View results.
wide_results

You should obtain wide_results like:

# A tibble: 14 x 9
   date       day   duration time_in_1 time_out_1 time_in_2 time_out_2 time_in_3 time_out_3
   <date>     <chr>    <dbl> <chr>     <chr>      <chr>     <chr>      <chr>     <chr>     
 1 2016-12-03 Sat      10.0  358a      947a       1017a     229p       NA        NA        
 2 2016-12-04 Sun       0    NA        NA         NA        NA         NA        NA        
 3 2016-12-05 Mon      14.0  548a      1121a      1149a     618p       650p      845p      
 4 2016-12-06 Tue      11.5  359a      1120a      1150a     400p       NA        NA        
 5 2016-12-07 Wed      11.0  548a      1119a      1149a     515p       NA        NA        
 6 2016-12-08 Thu      10.5  549a      1120a      1149a     447p       NA        NA        
 7 2016-12-09 Fri       8.03 548a      1120a      1148a     218p       NA        NA        
 8 2016-12-10 Sat       8    559a      1106a      1134a     227p       NA        NA        
 9 2016-12-11 Sun       0    NA        NA         NA        NA         NA        NA        
10 2016-12-12 Mon      14.5  500a      1121a      1151a     547p       616p      830p      
11 2016-12-13 Tue      10.5  548a      1120a      1148a     449p       NA        NA        
12 2016-12-14 Wed      11.6  429a      1120a      1150a     432p       NA        NA        
13 2016-12-15 Thu      11.5  429a      1121a      1152a     431p       NA        NA        
14 2016-12-16 Fri      11.6  430a      1120a      1150a     432p       NA        NA        

Note

You must supply the year_observed (here 2016) to correctly contextualize the dates written in m/d format. Otherwise, they will calibrate to the year 2021, which will skew the days of the week.

Warning

These dates (12/3, etc.) are in December, and close to the end of the calendar year. If any of these entries "cross over" (from 2016) into the next year (ex. 1/1/2017), they will be incorrectly calibrated to the former year (ex. 1/1/2016), and thus have an incorrect date and weekday.

However, if your dates do cross over, that's a good indication that the full date (12/3/2016) should have been notated in the original cells, in which case

results <- df %>%
  # ... %>%
  mutate(
    # ...
    date = as.Date(str_extract(date, "(\\d{1,2}/){2,2}\\d{4,4}$"), format = "%m/%d/%Y")
    # ...
  ) # ... %>%

would have sufficed to properly parse the dates.

Upvotes: 3

Related Questions