yokota
yokota

Reputation: 1017

String split on a number word pattern

I have a data frame that looks like this:

V1                        V2
peanut butter sandwich    2 slices of bread 1 tablespoon peanut butter

What I'm aiming to get is:

V1                        V2
peanut butter sandwich    2 slices of bread
peanut butter sandwich    1 tablespoon peanut butter

I've tried to split the string using strsplit(df$v2, " "), but I can only split by the " ". I'm not sure if you can split the string only at the first number and then take the characters until the next number.

Upvotes: 6

Views: 186

Answers (2)

Jota
Jota

Reputation: 17611

You can split the string as follows:

txt <- "2 slices of bread 1 tablespoon peanut butter"

strsplit(txt, " (?=\\d)", perl=TRUE)[[1]]
#[1] "2 slices of bread"          "1 tablespoon peanut butter"

The regex being used here is looking for spaces followed by a digit. It uses a zero-width positive lookahead (?=) to say that if the space is followed by a digit (\\d), then it's the type of space we want to split on. Why the zero-width lookahead? It's because we don't want to use the digit as a splitting character, we just want match any space that is followed by a digit.

To use that idea and construct your data frame, see this example:

item <- c("peanut butter sandwich", "onion carrot mix", "hash browns")
txt <- c("2 slices of bread 1 tablespoon peanut butter", "1 onion 3 carrots", "potato")
df <- data.frame(item, txt, stringsAsFactors=FALSE)

# thanks to Ananda for recommending setNames
split.strings <- setNames(strsplit(df$txt, " (?=\\d)", perl=TRUE), df$item) 
# alternately: 
#split.strings <- strsplit(df$txt, " (?=\\d)", perl=TRUE)
#names(split.strings) <- df$item

stack(split.strings)
#                      values                    ind
#1          2 slices of bread peanut butter sandwich
#2 1 tablespoon peanut butter peanut butter sandwich
#3                    1 onion       onion carrot mix
#4                  3 carrots       onion carrot mix
#5                     potato            hash browns

Upvotes: 7

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Let's imagine you're dealing with something like:

mydf <- data.frame(
  V1 = c("peanut butter sandwich", "peanut butter and jam sandwich"), 
  V2 = c("2 slices of bread 1 tablespoon peanut butter", 
         "2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam"))  

mydf
##                               V1
## 1         peanut butter sandwich
## 2 peanut butter and jam sandwich
##                                                              V2
## 1                  2 slices of bread 1 tablespoon peanut butter
## 2 2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam

You can first add in a delimiter that you don't expect in "V2", and use cSplit from my "splitstackshape" to get the "long" dataset format.

library(splitstackshape)
mydf$V2 <- gsub(" (\\d+)", "|\\1", mydf$V2)
cSplit(mydf, "V2", "|", "long")
##                                V1                         V2
## 1:         peanut butter sandwich          2 slices of bread
## 2:         peanut butter sandwich 1 tablespoon peanut butter
## 3: peanut butter and jam sandwich          2 slices of bread
## 4: peanut butter and jam sandwich 1 tablespoon peanut butter
## 5: peanut butter and jam sandwich           1 tablespoon jam

The following aren't really enough to post on their own as an answer, because they are variations on @Jota's approach, but I'm sharing them here for completeness:

strsplit within "data.table"

The split list is automatically flattened into a single column....

library(data.table)
as.data.table(mydf)[, list(
  V2 = unlist(strsplit(as.character(V2), '\\s(?=\\d)', perl=TRUE))), by = V1]

"dplyr" + "tidyr"

You can use unnest from "tidyr" to expand the list column into a long form....

library(dplyr)
library(tidyr)
mydf %>% 
  mutate(V2 = strsplit(as.character(V2), " (?=\\d)", perl=TRUE)) %>% 
  unnest(V2)

Upvotes: 5

Related Questions