Reputation: 31
I am looking for help implementing a function in R to truncate a level_stream
string vector of my dataframe in R and haven't had much luck yet. Essentially when a row in the pre_quiz_score
column is not NA
, I want to truncate the beginning part of the string up until (and including) the first |
character, and I want to truncate everything past the last |
character if a post_quiz_score
is not NA
for that row.
df <- data.frame(ls = c('123 L0=38/42|425 L0=40/42', NA, '482 L7=7/12|789 L8=5/6|523 L9=2/6'),
pre_quiz_score = c(88, NA, 12),
post_quiz_score = c(NA, NA, 90))
I want to implement this in a "tidyverse" way and vectorized to get something like
----------------------------------------------------------------------------
| ls | pre_quiz_score | post_quiz_score |
| 425 L0=40/42 | 88 | NA |
| NA | NA | NA |
| 789 L8=5/6 | 12 | 90 |
So far, I haven't gotten stringr::str_split
, gsub
, or sub
to work correctly, mostly because I end up removing just the |
's or all the string but the last |
and after.
I hope that makes sense, thanks!
Upvotes: 1
Views: 402
Reputation: 3947
tidyr::separate()
allows you to split up a column into sub-columns. With the extra = "drop"
argument it will keep only up to length(into)
columns.
library(tidyr)
separate(df, ls, c("remove", "keep"), sep="\\|", extra = "drop")
#> remove keep pre_quiz_score post_quiz_score
#> 1 123 L0=38/42 425 L0=40/42 88 NA
#> 2 <NA> <NA> NA NA
#> 3 482 L7=7/12 789 L8=5/6 12 90
I've kept the remaining part after the first |
but you can remove that too if you don't need it.
Upvotes: 0
Reputation: 8413
library(dplyr)
df %>% mutate(ls = sapply(strsplit(df$ls, "\\|"), function(x) x[2]))
# ls pre_quiz_score post_quiz_score
#1 425 L0=40/42 88 NA
#2 <NA> NA NA
#3 789 L8=5/6 12 90
Upvotes: 2
Reputation: 886978
We can use sub
from base R
df$ls <- sub("^[^|]+\\|([^|]+).*", "\\1", df$ls)
df
# ls pre_quiz_score post_quiz_score
#1 425 L0=40/42 88 NA
#2 <NA> NA NA
#3 789 L8=5/6 12 90
We match one or more characters that are not a |
([^|]+
) from the start (^
) of the string, followed by a |
(escape it -\\|
as a it is a metacharacter), then capture one or more characters that are not a |
as a group (i.e. inside the parentheses ([^|]+)
) followed by characters until the end of the string (.*
) and replace it with the backreference of the captured group (\\1
- as there is only a single capture group and it is the first one, we denote it by 1)
Upvotes: 4
Reputation: 78792
Just implement the logic as you stated it:
library(stringi)
library(dplyr)
df <- data.frame(ls = c('123 L0=38/42|425 L0=40/42', NA, '482 L7=7/12|789 L8=5/6|523 L9=2/6'),
pre_quiz_score = c(88, NA, 12),
post_quiz_score = c(NA, NA, 90),
stringsAsFactors=FALSE)
df %>%
mutate(ls=ifelse(!is.na(pre_quiz_score),
stri_replace_first_regex(ls, "^[[:alnum:][:blank:]=/]+\\|", ""), ls),
ls=ifelse(!is.na(post_quiz_score),
stri_replace_last_regex(ls, "\\|[[:alnum:][:blank:]=/]+$", ""), ls))
## ls pre_quiz_score post_quiz_score
## 1 425 L0=40/42 88 NA
## 2 <NA> NA NA
## 3 789 L8=5/6 12 90
Upvotes: 3