sydg
sydg

Reputation: 82

Splitting unspecific length string into columns

I have a column of strings, and I am trying to split them into columns. I've read a couple of stuff online (at the bottom), but I think this is different.

stringColumn <- c(
  "50m 26.50 26.50 100m 53.82 27.32",                                     
  NA,                                                                      
  "50m 25.83 25.83 100m 52.99 27.16",                                      
  "25m 12.46 12.46 50m 26.26 13.80 75m 40.02 13.76 100m 53.48 13.46",      
  NA,                                                                      
  "25m 11.72 11.72 50m 25.33 13.61 75m 39.15 13.82 100m 52.55 13.40",      
  "50m 27.20 27.20 100m 56.38 29.18 150m 1:26.39 30.01 200m 1:56.16 29.77")

This goes on for a while (970K rows), and the strings follow the same format, either NA or (measurement totaltime timeatcurrentpoint, repeating). Because it is not equal, I can't use regex or str_split. My goal is to get a dataframe with the column names of measurements (25m, 50m, 100m, and so on) with the totaltime at that point.

Desired output:

25m     50m     75m     100m    150m  200m    
NA      26.5    NA      53.82   NA    NA       
NA      NA      NA      NA      NA    NA       
12.46   26.26   40.02   53.48   NA    NA       
NA      NA      NA      NA      NA    NA       
11.72   25.33   39.15   52.55   NA    NA       
NA      27.2    NA      56.38   NA    1:56.16 

Upvotes: 0

Views: 99

Answers (2)

Calum You
Calum You

Reputation: 15072

Here's one way using tidyverse tools. We can:

  1. Use str_split with a (?=\\d+?m) lookahead to only split each string at the markers between distances (i.e. spaces followed by some digits and m);
  2. Put into a tibble, add a rowid column so we know which measurements came from which strings;
  3. unnest so each measurement is on its own row;
  4. separate so each of measurement, totaltime and current time has its own column, and discard current time
  5. spread to make the measurements the column headers.

Note that this doesn't quite match your sample output, ignoring the column order, but I think it correctly corresponds to the input.

library(tidyverse)

stringColumn = c(
  "50m 26.50 26.50 100m 53.82 27.32",
  NA,
  "50m 25.83 25.83 100m 52.99 27.16",
  "25m 12.46 12.46 50m 26.26 13.80 75m 40.02 13.76 100m 53.48 13.46",
  NA,
  "25m 11.72 11.72 50m 25.33 13.61 75m 39.15 13.82 100m 52.55 13.40",
  "50m 27.20 27.20 100m 56.38 29.18 150m 1:26.39 30.01 200m 1:56.16 29.77"
)

stringColumn %>%
  str_split(" (?=\\d+?m)") %>%
  tibble(strings = .) %>%
  rowid_to_column(var = "stringid") %>%
  unnest(strings) %>%
  separate(strings, c("measurement", "totaltime", "timeatcurrentpoint"), sep = " ") %>%
  select(-timeatcurrentpoint) %>%
  spread(measurement, totaltime)
#> # A tibble: 7 x 8
#>   stringid `100m` `150m`  `200m`  `25m` `50m` `75m` `<NA>`
#>      <int> <chr>  <chr>   <chr>   <chr> <chr> <chr> <chr> 
#> 1        1 53.82  <NA>    <NA>    <NA>  26.50 <NA>  <NA>  
#> 2        2 <NA>   <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  
#> 3        3 52.99  <NA>    <NA>    <NA>  25.83 <NA>  <NA>  
#> 4        4 53.48  <NA>    <NA>    12.46 26.26 40.02 <NA>  
#> 5        5 <NA>   <NA>    <NA>    <NA>  <NA>  <NA>  <NA>  
#> 6        6 52.55  <NA>    <NA>    11.72 25.33 39.15 <NA>  
#> 7        7 56.38  1:26.39 1:56.16 <NA>  27.20 <NA>  <NA>

Upvotes: 4

moodymudskipper
moodymudskipper

Reputation: 47340

We can change your data into an expression and evaluate it :

stringColumn2 <-  gsub("\\s*(\\d+m)\\s*","'),`\\1`=c('",stringColumn[!is.na(stringColumn)])
stringColumn3 <-  paste0(gsub("^'),","list(",stringColumn2),"'))")
stringColumn4 <-  gsub("\\s+","','",stringColumn3)
stringColumn5 <- paste0("list(",paste(stringColumn4,collapse=","),")")

library(dplyr)
bind_rows(eval(parse(text=stringColumn5)),.id = "id")
# # A tibble: 10 x 7
#    id    `50m` `100m` `25m` `75m` `150m`  `200m` 
#    <chr> <chr> <chr>  <chr> <chr> <chr>   <chr>  
#  1 1     26.50 53.82  <NA>  <NA>  <NA>    <NA>   
#  2 1     26.50 27.32  <NA>  <NA>  <NA>    <NA>   
#  3 2     25.83 52.99  <NA>  <NA>  <NA>    <NA>   
#  4 2     25.83 27.16  <NA>  <NA>  <NA>    <NA>   
#  5 3     26.26 53.48  12.46 40.02 <NA>    <NA>   
#  6 3     13.80 13.46  12.46 13.76 <NA>    <NA>   
#  7 4     25.33 52.55  11.72 39.15 <NA>    <NA>   
#  8 4     13.61 13.40  11.72 13.82 <NA>    <NA>   
#  9 5     27.20 56.38  <NA>  <NA>  1:26.39 1:56.16
# 10 5     27.20 29.18  <NA>  <NA>  30.01   29.77  

intermediate steps:

stringColumn2[1]
# [1] "'),`50m`=c('26.50 26.50'),`100m`=c('53.82 27.32"
stringColumn3[1]
# [1] "list(`50m`=c('26.50 26.50'),`100m`=c('53.82 27.32'))"
stringColumn4[1]
# [1] "list(`50m`=c('26.50','26.50'),`100m`=c('53.82','27.32'))"

Upvotes: 3

Related Questions