Reputation: 1209
I have data as given in input section (dput below), need to convert to output with all values of two rows in one long column. I tried using transpose but cells were getting trimmed.
I don't want to hardcode since in future I might have data in 3 or 4 rows in a similar way.
P.S - I also tried pivot_longer but it didnt help
structure(list(Header = c("Sat 12/3 \n358a-947a\n1017a-229p HRS 10.02",
"Sat 12/10 \n559a-1106a\n1134a-227p HRS 8.00"), X = c("Sun 12/4 ",
"Sun 12/11 "), X.1 = c("Mon 12/5 \n548a-1121a\n1149a-618p\n650p-845p HRS 13.95",
"Mon 12/12 \n500a-1121a\n1151a-547p\n616p-830p HRS 14.53"),
X.2 = c("Tue 12/6 \n359a-1120a\n1150a-400p HRS 11.53",
"Tue 12/13 \n548a-1120a\n1148a-449p HRS 10.54"), X.3 = c("Wed 12/7 \n548a-1119a\n1149a-515p HRS 10.95",
"Wed 12/14 \n429a-1120a\n1150a-432p HRS 11.56"), X.4 = c("Thu 12/8 \n549a-1120a\n1149a-447p HRS 10.48",
"Thu 12/15 \n429a-1121a\n1152a-431p HRS 11.52"), X.5 = c("Fri 12/9 \n548a-1120a\n1148a-218p HRS 8.03",
"Fri 12/16 \n430a-1120a\n1150a-432p HRS 11.55")), class = "data.frame", row.names = c(NA,
-2L))
My try (with a little help)
pivot_longer(df, cols = c(1:7)) %>%
select(value) %>%
mutate(value=str_replace(value,"HRS","")) %>%
separate(.,value,into=c("day","entry1","entry2","entry3"),sep="\n") %>%
separate(.,entry1,into=c("time_in1","time_out1"),sep="-") %>%
separate(.,entry2,into=c("time_in2","time_out2"),sep="-") %>%
separate(.,time_out2,into=c("time_out2","duration1"),remove = FALSE,sep=" ",extra = "merge") %>%
separate(.,entry3,into=c("time_in3","time_out3"),sep="-") %>%
separate(.,time_out3,into=c("time_out3","duration2"),remove = FALSE,sep=" ") %>%
mutate(duration=coalesce(duration1,duration2)) %>%
select(day, duration, time_in1,time_out1,time_in2,time_out2,time_in3,time_out3) %>%
separate(.,day,into=c("date","day"),extra="merge") %>%
mutate(day=mdy(paste0(day,"2021")),
duration=str_trim(duration))
Upvotes: 2
Views: 487
Reputation: 3326
The key was tidyr::separate_rows()
, which not only separates the cell by "\n"
but also splits the components into rows rather than columns.
Here, it is much better to split into rows than into columns. Suppose that most cells have 2 or 3 entries separated by "\n"
; but there is a "rogue" cell, with an unusually large number (say 9) of entries, generated by someone who repeatedly clocked in and out throughout the day.
While splitting into columns would create arbitrarily many time_in* | time_out*
columns, which remain empty (NA
) in all rows except the "rogue"
date day duration time_in1 time_out1 time_in2 time_out2 time_in3 time_out3 time_in4 time_out4 time_in5 time_out5 time_in6 time_out6 time_in7 time_out7 time_in8 time_out8 time_in9 time_out9
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# ... ... ... ... ... ... ... ... ... NA NA NA NA NA NA NA NA NA NA NA NA
splitting into rows will maintain a tame (and stable) columnar structure
date day duration time_in time_out
<date> <chr> <dbl> <chr> <chr>
# ... ... ... ... ...
# ... ... ... ... ...
# ... ... ... ... ...
without any "extraneous" columns (or rows).
Given your sample data df
df <- structure(list(Header = c("Sat 12/3 \n358a-947a\n1017a-229p HRS 10.02", "Sat 12/10 \n559a-1106a\n1134a-227p HRS 8.00"),
X = c("Sun 12/4 ", "Sun 12/11 "),
X.1 = c("Mon 12/5 \n548a-1121a\n1149a-618p\n650p-845p HRS 13.95", "Mon 12/12 \n500a-1121a\n1151a-547p\n616p-830p HRS 14.53"),
X.2 = c("Tue 12/6 \n359a-1120a\n1150a-400p HRS 11.53", "Tue 12/13 \n548a-1120a\n1148a-449p HRS 10.54"),
X.3 = c("Wed 12/7 \n548a-1119a\n1149a-515p HRS 10.95", "Wed 12/14 \n429a-1120a\n1150a-432p HRS 11.56"),
X.4 = c("Thu 12/8 \n549a-1120a\n1149a-447p HRS 10.48", "Thu 12/15 \n429a-1121a\n1152a-431p HRS 11.52"),
X.5 = c("Fri 12/9 \n548a-1120a\n1148a-218p HRS 8.03", "Fri 12/16 \n430a-1120a\n1150a-432p HRS 11.55")),
class = "data.frame", row.names = c(NA, -2L))
the following workflow
library(tidyverse)
library(stringr)
# ...
# Code to generate 'df'.
# ...
year_observed <- 2016
results <- df %>%
mutate(id = row_number()) %>%
pivot_longer(!id, names_to = "column") %>%
separate(value, into = c("date", "entries"), sep = "\n", fill = "right", extra = "merge", remove = TRUE) %>%
separate(entries, into = c("times", "duration"), sep = "HRS", fill = "right", extra = "warn", remove = TRUE) %>%
mutate(across(date:duration, trimws),
date = as.Date(paste(str_extract(date, "\\d{1,2}/\\d{1,2}$"), year_observed, sep = "/"), format = "%m/%d/%Y"),
duration = as.numeric(duration),
duration = if_else(is.na(duration), 0, duration),
day = format(date, format = "%a")) %>%
separate_rows(times, sep = "\n") %>%
separate(times, into = c("time_in", "time_out"), sep = "-", fill = "warn", extra = "warn", remove = TRUE) %>%
# ...Further Transformations... %>%
select(id, date, day, duration, time_in, time_out)
# View results.
results
should yield results
like
# A tibble: 28 x 6
id date day duration time_in time_out
<int> <date> <chr> <dbl> <chr> <chr>
1 1 2016-12-03 Sat 10.0 358a 947a
2 1 2016-12-03 Sat 10.0 1017a 229p
3 1 2016-12-04 Sun 0 NA NA
4 1 2016-12-05 Mon 14.0 548a 1121a
5 1 2016-12-05 Mon 14.0 1149a 618p
6 1 2016-12-05 Mon 14.0 650p 845p
7 1 2016-12-06 Tue 11.5 359a 1120a
8 1 2016-12-06 Tue 11.5 1150a 400p
9 1 2016-12-07 Wed 11.0 548a 1119a
10 1 2016-12-07 Wed 11.0 1149a 515p
# ... with 18 more rows
where id
identifies (by row number) the original record in df
.
To pivot into your newly specified output, simply execute this code, or append it to the existing workflow:
wide_results <- results %>%
group_by(id, date) %>% mutate(entry = row_number()) %>% ungroup() %>%
pivot_wider(id_cols = c(date, day, duration), names_from = entry, names_glue = "{.value}_{entry}", values_from = c(time_in, time_out)) %>%
# Select so as to alternate between 'time_in_*' and 'time_out_*'.
select(order(as.numeric(str_extract(colnames(.), "\\d+$")), str_extract(colnames(.), "^time_(in|out)"), na.last = FALSE))
# View results.
wide_results
You should obtain wide_results
like:
# A tibble: 14 x 9
date day duration time_in_1 time_out_1 time_in_2 time_out_2 time_in_3 time_out_3
<date> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 2016-12-03 Sat 10.0 358a 947a 1017a 229p NA NA
2 2016-12-04 Sun 0 NA NA NA NA NA NA
3 2016-12-05 Mon 14.0 548a 1121a 1149a 618p 650p 845p
4 2016-12-06 Tue 11.5 359a 1120a 1150a 400p NA NA
5 2016-12-07 Wed 11.0 548a 1119a 1149a 515p NA NA
6 2016-12-08 Thu 10.5 549a 1120a 1149a 447p NA NA
7 2016-12-09 Fri 8.03 548a 1120a 1148a 218p NA NA
8 2016-12-10 Sat 8 559a 1106a 1134a 227p NA NA
9 2016-12-11 Sun 0 NA NA NA NA NA NA
10 2016-12-12 Mon 14.5 500a 1121a 1151a 547p 616p 830p
11 2016-12-13 Tue 10.5 548a 1120a 1148a 449p NA NA
12 2016-12-14 Wed 11.6 429a 1120a 1150a 432p NA NA
13 2016-12-15 Thu 11.5 429a 1121a 1152a 431p NA NA
14 2016-12-16 Fri 11.6 430a 1120a 1150a 432p NA NA
You must supply the year_observed
(here 2016
) to correctly contextualize the dates written in m/d
format. Otherwise, they will calibrate to the year 2021
, which will skew the day
s of the week.
These dates (12/3
, etc.) are in December, and close to the end of the calendar year. If any of these entries "cross over" (from 2016
) into the next year (ex. 1/1/2017
), they will be incorrectly calibrated to the former year (ex. 1/1/2016
), and thus have an incorrect date
and weekday
.
However, if your dates do cross over, that's a good indication that the full date (12/3/2016
) should have been notated in the original cells, in which case
results <- df %>%
# ... %>%
mutate(
# ...
date = as.Date(str_extract(date, "(\\d{1,2}/){2,2}\\d{4,4}$"), format = "%m/%d/%Y")
# ...
) # ... %>%
would have sufficed to properly parse the date
s.
Upvotes: 3