Reputation: 64

Separating values into existing column in R

I'm tidying some data that I read into R from a PDF using tabulizer. Unfortunately some cells haven't been read properly. In column 9 (Split 5 at 37.1km) rows 3 and 4 contain information that should have ended up in column 10 (Final Time).

How do I separate that column (9) just for these rows and paste the necessary data into an already existing column (10)?

I know how to use tidyr::separate function but can't figure out how (an if) to apply it here. Any help and guidance will be appreciated.

structure(list(Rank = c("23", "24", "25", "26"), `Race Number` = c("13", 
"11", "29", "30"), Name = c("FOSS Tobias S.", "McNULTY Brandon", 
"BENNETT George", "KUKRLE Michael"), `NOC Code` = c("NOR", "USA", 
"NZL", "CZE"), `Split 1 at 9.7km` = c("13:47.65(22)", "13:28.23(15)", 
"14:05.46(30)", "14:05.81(32)"), `Split 2 at 15.0km` = c("19:21.16(22)", 
"19:04.80(18)", "19:47.53(31)", "19:48.77(32)"), `Split 3 at 22.1km` = c("29:17.44(24)", 
"29:01.94(20)", "29:58.88(28)", "29:58.09(27)"), `Split 4 at 31.8km` = c("44:06.82(24)", 
"43:51.67(23)", "44:40.28(25)", "44:42.74(26)"), `Split 5 at 37.1km` = c("49:49.65(24)", 
"49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"
), `Final Time` = c("59:51.68 (23)", "59:57.73 (24)", "", ""), 
    `Time Behind` = c("+4:47.49", "+4:53.54", "+5:24.20", "+5:37.36"
    ), `Average Speed` = c("44.302", "44.228", "43.854", "43.696"
    )), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 0

Answers (4)

iago

Reputation: 3266

Calling df to your dataframe:

library(tidyr)
library(dplyr)
df %>%  
  separate(`Split 5 at 37.1km`, into = c("Split 5 at 37.1km","aux"), sep = "\\)") %>% 
  mutate(`Final Time` = coalesce(if_else(`Final Time`!="",`Final Time`, NA_character_), paste0(aux, ")")), 
          aux = NULL, 
          `Split 5 at 37.1km` = paste0(`Split 5 at 37.1km`, ")"))

  Rank Race Number            Name NOC Code Split 1 at 9.7km Split 2 at 15.0km Split 3 at 22.1km Split 4 at 31.8km Split 5 at 37.1km      Final Time
1   23          13  FOSS Tobias S.      NOR     13:47.65(22)      19:21.16(22)      29:17.44(24)      44:06.82(24)      49:49.65(24)   59:51.68 (23)
2   24          11 McNULTY Brandon      USA     13:28.23(15)      19:04.80(18)      29:01.94(20)      43:51.67(23)      49:40.49(23)   59:57.73 (24)
3   25          29  BENNETT George      NZL     14:05.46(30)      19:47.53(31)      29:58.88(28)      44:40.28(25)      50:21.82(25) 1:00:28.39 (25)
4   26          30  KUKRLE Michael      CZE     14:05.81(32)      19:48.77(32)      29:58.09(27)      44:42.74(26)      50:30.02(26) 1:00:41.55 (26)
  Time Behind Average Speed
1    +4:47.49        44.302
2    +4:53.54        44.228
3    +5:24.20        43.854
4    +5:37.36        43.696

Upvotes: 2

Joel Kandiah

Reputation: 1525

I like to use regex and stringr. Whilst theres some suboptimal code here the key step is with str_extract(). Using this we can select the two substrings we want, that of the first time and that of the second time. If either time is missing then we will have a missing value. So we can then fill in the columns based on where missingness occurs.

The regex string is as follows^((\\d+:)?\\d{2}:\\d{2}.\\d{2}\$\\d+\$)\\.?+((\\d+:)?\\d{2}:\\d{2}.\\d{2} \$\\d+\$)$. Here we have 4 capture groups, the first and third group capture the two whole times respectively. the second and fourth select the optional groups containing the hour (this ensures that times over an hour are completely captured. Additionally we check for an optional space.

My code is as follows:

library(tidyverse)

data <- structure(list(Rank = c("23", "24", "25", "26"), `Race Number` = c("13", 
                                                                   "11", "29", "30"), Name = c("FOSS Tobias S.", "McNULTY Brandon", 
                                                                                               "BENNETT George", "KUKRLE Michael"), `NOC Code` = c("NOR", "USA", 
                                                                                                                                                   "NZL", "CZE"), `Split 1 at 9.7km` = c("13:47.65(22)", "13:28.23(15)", 
                                                                                                                                                                                         "14:05.46(30)", "14:05.81(32)"), `Split 2 at 15.0km` = c("19:21.16(22)", 
                                                                                                                                                                                                                                                  "19:04.80(18)", "19:47.53(31)", "19:48.77(32)"), `Split 3 at 22.1km` = c("29:17.44(24)", 
                                                                                                                                                                                                                                                                                                                           "29:01.94(20)", "29:58.88(28)", "29:58.09(27)"), `Split 4 at 31.8km` = c("44:06.82(24)", 
                                                                                                                                                                                                                                                                                                                                                                                                    "43:51.67(23)", "44:40.28(25)", "44:42.74(26)"), `Split 5 at 37.1km` = c("49:49.65(24)", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                             "49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"
                                                                                                                                                                                                                                                                                                                                                                                                    ), `Final Time` = c("59:51.68 (23)", "59:57.73 (24)", "", ""), 
               `Time Behind` = c("+4:47.49", "+4:53.54", "+5:24.20", "+5:37.36"
               ), `Average Speed` = c("44.302", "44.228", "43.854", "43.696"
               )), class = "data.frame", row.names = c(NA, -4L))

# Take data and use a matching string to the regex pattern
 data |> 
  mutate(match = map(`Split 5 at 37.1km`, ~unlist(str_match(., "^((\\d+:)?\\d{2}:\\d{2}.\\d{2}\\(\\d+\\))((\\d+:)?\\d{2}:\\d{2}.\\d{2} ?\\(\\d+\\))$")))) |> 
# Grab the strings that match the whole first and second/final times
  mutate(match1 = map(match, ~.[[2]]), match2 = map(match, ~.[[4]]), .keep = "unused") |> 
# Check where the NAs are and put into the dataframe accordingly
  mutate(`Split 5 at 37.1km`= ifelse(is.na(match1), `Split 5 at 37.1km`, match1),
         `Final Time` = ifelse(is.na(match2), `Final Time`, match2), .keep = "unused")
#>   Rank Race Number            Name NOC Code Split 1 at 9.7km Split 2 at 15.0km
#> 1   23          13  FOSS Tobias S.      NOR     13:47.65(22)      19:21.16(22)
#> 2   24          11 McNULTY Brandon      USA     13:28.23(15)      19:04.80(18)
#> 3   25          29  BENNETT George      NZL     14:05.46(30)      19:47.53(31)
#> 4   26          30  KUKRLE Michael      CZE     14:05.81(32)      19:48.77(32)
#>   Split 3 at 22.1km Split 4 at 31.8km Split 5 at 37.1km      Final Time
#> 1      29:17.44(24)      44:06.82(24)      49:49.65(24)   59:51.68 (23)
#> 2      29:01.94(20)      43:51.67(23)      49:40.49(23)   59:57.73 (24)
#> 3      29:58.88(28)      44:40.28(25)      50:21.82(25) 1:00:28.39 (25)
#> 4      29:58.09(27)      44:42.74(26)      50:30.02(26) 1:00:41.55 (26)
#>   Time Behind Average Speed
#> 1    +4:47.49        44.302
#> 2    +4:53.54        44.228
#> 3    +5:24.20        43.854
#> 4    +5:37.36        43.696

^{Created on 2021-07-28 by the reprex package (v2.0.0)}

Note in the above I use the base pipe from R 4.1 onwards |> this can be replaced simply with the magrittr pipe %>% if you are on an earlier R version.

Upvotes: 1

Zoe

Reputation: 1000

My answer is not really fancy, but it does the job for any number in the final time column. It works as long as there are always numbers in brackets at the end.

# dummy df
df <- data.frame("split" = c("49:49.65(24)", "49:40.49(23)", "50:21.82(25)1:00:28.39 (25)", "50:30.02(26)1:00:41.55 (26)"),
                 "final" = c("59:51.68 (23)", "59:57.73 (24)", "", ""))

# combining & splitting strings
merge_strings <- paste0(df$split, df$final)      
split_strings <- strsplit(merge_strings, ")")
df$split <- paste0(unlist(lapply(split_strings, "[[", 1)),")")
df$final <- paste0(unlist(lapply(split_strings, "[[", 2)),")")

This gives:

         split           final
1 49:49.65(24)   59:51.68 (23)
2 49:40.49(23)   59:57.73 (24)
3 50:21.82(25) 1:00:28.39 (25)
4 50:30.02(26) 1:00:41.55 (26)

Upvotes: 3

Martin Gal

Reputation: 16998

You could use dplyr and stringr:

library(dplyr)
library(stringr)

data %>% 
  mutate(`Final Time` = ifelse(`Final Time` == "", str_remove(`Split 5 at 37.1km`, "\\d+:\\d+\\.\\d+\\(\\d+\\)"), `Final Time`),
         `Split 5 at 37.1km` = str_extract(`Split 5 at 37.1km`, "\\d+:\\d+\\.\\d+\\(\\d+\\)"))

which returns

  Rank Race Number            Name NOC Code Split 1 at 9.7km Split 2 at 15.0km Split 3 at 22.1km Split 4 at 31.8km
1   23          13  FOSS Tobias S.      NOR     13:47.65(22)      19:21.16(22)      29:17.44(24)      44:06.82(24)
2   24          11 McNULTY Brandon      USA     13:28.23(15)      19:04.80(18)      29:01.94(20)      43:51.67(23)
3   25          29  BENNETT George      NZL     14:05.46(30)      19:47.53(31)      29:58.88(28)      44:40.28(25)
4   26          30  KUKRLE Michael      CZE     14:05.81(32)      19:48.77(32)      29:58.09(27)      44:42.74(26)
  Split 5 at 37.1km      Final Time Time Behind Average Speed
1      49:49.65(24)   59:51.68 (23)    +4:47.49        44.302
2      49:40.49(23)   59:57.73 (24)    +4:53.54        44.228
3      50:21.82(25) 1:00:28.39 (25)    +5:24.20        43.854
4      50:30.02(26) 1:00:41.55 (26)    +5:37.36        43.696

Upvotes: 2

Separating values into existing column in R

Answers (4)

Related Questions