String split on a number word pattern

Question

I have a data frame that looks like this:

V1                        V2
peanut butter sandwich    2 slices of bread 1 tablespoon peanut butter

What I'm aiming to get is:

V1                        V2
peanut butter sandwich    2 slices of bread
peanut butter sandwich    1 tablespoon peanut butter

I've tried to split the string using strsplit(df$v2, " "), but I can only split by the " ". I'm not sure if you can split the string only at the first number and then take the characters until the next number.

Jota · Accepted Answer

You can split the string as follows:

txt <- "2 slices of bread 1 tablespoon peanut butter"

strsplit(txt, " (?=\d)", perl=TRUE)[[1]]
#[1] "2 slices of bread"          "1 tablespoon peanut butter"

The regex being used here is looking for spaces followed by a digit. It uses a zero-width positive lookahead (?=) to say that if the space is followed by a digit (\d), then it's the type of space we want to split on. Why the zero-width lookahead? It's because we don't want to use the digit as a splitting character, we just want match any space that is followed by a digit.

To use that idea and construct your data frame, see this example:

item <- c("peanut butter sandwich", "onion carrot mix", "hash browns")
txt <- c("2 slices of bread 1 tablespoon peanut butter", "1 onion 3 carrots", "potato")
df <- data.frame(item, txt, stringsAsFactors=FALSE)

# thanks to Ananda for recommending setNames
split.strings <- setNames(strsplit(df$txt, " (?=\d)", perl=TRUE), df$item) 
# alternately: 
#split.strings <- strsplit(df$txt, " (?=\d)", perl=TRUE)
#names(split.strings) <- df$item

stack(split.strings)
#                      values                    ind
#1          2 slices of bread peanut butter sandwich
#2 1 tablespoon peanut butter peanut butter sandwich
#3                    1 onion       onion carrot mix
#4                  3 carrots       onion carrot mix
#5                     potato            hash browns

String split on a number word pattern

Answers (2)

`strsplit` within "data.table"

"dplyr" + "tidyr"

Related Questions

String split on a number word pattern

Answers (2)

strsplit within "data.table"

"dplyr" + "tidyr"

Related Questions

`strsplit` within "data.table"