Reputation: 82
I have a column of strings, and I am trying to split them into columns. I've read a couple of stuff online (at the bottom), but I think this is different.
stringColumn <- c(
"50m 26.50 26.50 100m 53.82 27.32",
NA,
"50m 25.83 25.83 100m 52.99 27.16",
"25m 12.46 12.46 50m 26.26 13.80 75m 40.02 13.76 100m 53.48 13.46",
NA,
"25m 11.72 11.72 50m 25.33 13.61 75m 39.15 13.82 100m 52.55 13.40",
"50m 27.20 27.20 100m 56.38 29.18 150m 1:26.39 30.01 200m 1:56.16 29.77")
This goes on for a while (970K rows), and the strings follow the same format, either NA or (measurement totaltime timeatcurrentpoint, repeating). Because it is not equal, I can't use regex or str_split. My goal is to get a dataframe with the column names of measurements (25m, 50m, 100m, and so on) with the totaltime at that point.
Desired output:
25m 50m 75m 100m 150m 200m
NA 26.5 NA 53.82 NA NA
NA NA NA NA NA NA
12.46 26.26 40.02 53.48 NA NA
NA NA NA NA NA NA
11.72 25.33 39.15 52.55 NA NA
NA 27.2 NA 56.38 NA 1:56.16
Upvotes: 0
Views: 99
Reputation: 15072
Here's one way using tidyverse
tools. We can:
str_split
with a (?=\\d+?m)
lookahead to only split each string at the markers between distances (i.e. spaces followed by some digits and m
);tibble
, add a rowid column so we know which measurements came from which strings;unnest
so each measurement is on its own row;separate
so each of measurement, totaltime and current time has its own column, and discard current timespread
to make the measurements the column headers.Note that this doesn't quite match your sample output, ignoring the column order, but I think it correctly corresponds to the input.
library(tidyverse)
stringColumn = c(
"50m 26.50 26.50 100m 53.82 27.32",
NA,
"50m 25.83 25.83 100m 52.99 27.16",
"25m 12.46 12.46 50m 26.26 13.80 75m 40.02 13.76 100m 53.48 13.46",
NA,
"25m 11.72 11.72 50m 25.33 13.61 75m 39.15 13.82 100m 52.55 13.40",
"50m 27.20 27.20 100m 56.38 29.18 150m 1:26.39 30.01 200m 1:56.16 29.77"
)
stringColumn %>%
str_split(" (?=\\d+?m)") %>%
tibble(strings = .) %>%
rowid_to_column(var = "stringid") %>%
unnest(strings) %>%
separate(strings, c("measurement", "totaltime", "timeatcurrentpoint"), sep = " ") %>%
select(-timeatcurrentpoint) %>%
spread(measurement, totaltime)
#> # A tibble: 7 x 8
#> stringid `100m` `150m` `200m` `25m` `50m` `75m` `<NA>`
#> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 53.82 <NA> <NA> <NA> 26.50 <NA> <NA>
#> 2 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 3 52.99 <NA> <NA> <NA> 25.83 <NA> <NA>
#> 4 4 53.48 <NA> <NA> 12.46 26.26 40.02 <NA>
#> 5 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 6 52.55 <NA> <NA> 11.72 25.33 39.15 <NA>
#> 7 7 56.38 1:26.39 1:56.16 <NA> 27.20 <NA> <NA>
Upvotes: 4
Reputation: 47340
We can change your data into an expression and evaluate it :
stringColumn2 <- gsub("\\s*(\\d+m)\\s*","'),`\\1`=c('",stringColumn[!is.na(stringColumn)])
stringColumn3 <- paste0(gsub("^'),","list(",stringColumn2),"'))")
stringColumn4 <- gsub("\\s+","','",stringColumn3)
stringColumn5 <- paste0("list(",paste(stringColumn4,collapse=","),")")
library(dplyr)
bind_rows(eval(parse(text=stringColumn5)),.id = "id")
# # A tibble: 10 x 7
# id `50m` `100m` `25m` `75m` `150m` `200m`
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 26.50 53.82 <NA> <NA> <NA> <NA>
# 2 1 26.50 27.32 <NA> <NA> <NA> <NA>
# 3 2 25.83 52.99 <NA> <NA> <NA> <NA>
# 4 2 25.83 27.16 <NA> <NA> <NA> <NA>
# 5 3 26.26 53.48 12.46 40.02 <NA> <NA>
# 6 3 13.80 13.46 12.46 13.76 <NA> <NA>
# 7 4 25.33 52.55 11.72 39.15 <NA> <NA>
# 8 4 13.61 13.40 11.72 13.82 <NA> <NA>
# 9 5 27.20 56.38 <NA> <NA> 1:26.39 1:56.16
# 10 5 27.20 29.18 <NA> <NA> 30.01 29.77
intermediate steps:
stringColumn2[1]
# [1] "'),`50m`=c('26.50 26.50'),`100m`=c('53.82 27.32"
stringColumn3[1]
# [1] "list(`50m`=c('26.50 26.50'),`100m`=c('53.82 27.32'))"
stringColumn4[1]
# [1] "list(`50m`=c('26.50','26.50'),`100m`=c('53.82','27.32'))"
Upvotes: 3