user5433097
user5433097

Reputation:

parse column in r (or other language such as SQL)

Here's the current dataframe:

baking_time <- c("20 to 30 min", "20 to 30 min", "40 to 50 min", "10 to 30 min", "60 to 90 min", "40 to 50 min")
cake_type <- c("Chocolate", "Chocolate","Lemon","Tart","German","Lemon")


recipes <- data.frame(baking_time, cake_type)

Now I'm trying to parse baking_time to get this:

baking_time <- c(25, 25, 45, 20, 75, 45)

I've tried using parse but I'm having trouble parsing both numbers than performing operations on them

mutate(avg_time = (parse_number(baking_time) + parse_number(baking_time))/2)

Upvotes: 8

Views: 223

Answers (3)

hrbrmstr
hrbrmstr

Reputation: 78792

stringi (the un-crutched stringr) and base R:

stringi::stri_match_first_regex(
  recipes$baking_time, 
  "([[:digit:]]+)[[:space:]]+to[[:space:]]+([[:digit:]]+)",
)[,2:3] -> x
class(x) <- "numeric"
apply(x, 1, mean)
## [1] 25 25 45 20 75 45

Upvotes: 2

akrun
akrun

Reputation: 887088

We extract the numeric part of the column and get the mean

library(tidyverse)
recipes %>% 
   mutate(avg_time = str_extract_all(baking_time, "\\d+") %>%
           map(., ~ mean(as.numeric(.x))))
#   baking_time cake_type avg_time
#1 20 to 30 min Chocolate       25
#2 20 to 30 min Chocolate       25
#3 40 to 50 min     Lemon       45
#4 10 to 30 min      Tart       20
#5 60 to 90 min    German       75
#6 40 to 50 min     Lemon       45

NOTE: readr::parse_number extracts the first numeric part. If there are multiple elements, needs to break it down and apply parse_number

recipes %>% 
   separate(baking_time, into = c("first", "second"),
            sep=" to ", remove = FALSE) %>% 
   transmute(baking_time, avg_time = (parse_number(first) + parse_number(second))/2)

With base R, one option is to read with read.csv after changing the non-numeric part to a delimiter with gsub, get the rowMeans

rowMeans(read.csv(text=gsub("\\D+", ",", recipes$baking_time), header = FALSE)[-3])
#[1] 25 25 45 20 75 45

Upvotes: 7

G5W
G5W

Reputation: 37641

You can do this in base R using gregexpr and regmatches to get the times.

Times = regmatches(baking_time, gregexpr("\\d+", baking_time))
sapply(Times, function(x) mean(as.numeric(x)))
[1] 25 25 45 20 75 45

Upvotes: 4

Related Questions