Reputation: 1017
I have a data frame that looks like this:
V1 V2
peanut butter sandwich 2 slices of bread 1 tablespoon peanut butter
What I'm aiming to get is:
V1 V2
peanut butter sandwich 2 slices of bread
peanut butter sandwich 1 tablespoon peanut butter
I've tried to split the string using strsplit(df$v2, " ")
, but I can only split by the " "
. I'm not sure if you can split the string only at the first number and then take the characters until the next number.
Upvotes: 6
Views: 186
Reputation: 17611
You can split the string as follows:
txt <- "2 slices of bread 1 tablespoon peanut butter"
strsplit(txt, " (?=\\d)", perl=TRUE)[[1]]
#[1] "2 slices of bread" "1 tablespoon peanut butter"
The regex being used here is looking for spaces followed by a digit. It uses a zero-width positive lookahead (?=)
to say that if the space is followed by a digit (\\d
), then it's the type of space we want to split on. Why the zero-width lookahead? It's because we don't want to use the digit as a splitting character, we just want match any space that is followed by a digit.
To use that idea and construct your data frame, see this example:
item <- c("peanut butter sandwich", "onion carrot mix", "hash browns")
txt <- c("2 slices of bread 1 tablespoon peanut butter", "1 onion 3 carrots", "potato")
df <- data.frame(item, txt, stringsAsFactors=FALSE)
# thanks to Ananda for recommending setNames
split.strings <- setNames(strsplit(df$txt, " (?=\\d)", perl=TRUE), df$item)
# alternately:
#split.strings <- strsplit(df$txt, " (?=\\d)", perl=TRUE)
#names(split.strings) <- df$item
stack(split.strings)
# values ind
#1 2 slices of bread peanut butter sandwich
#2 1 tablespoon peanut butter peanut butter sandwich
#3 1 onion onion carrot mix
#4 3 carrots onion carrot mix
#5 potato hash browns
Upvotes: 7
Reputation: 193517
Let's imagine you're dealing with something like:
mydf <- data.frame(
V1 = c("peanut butter sandwich", "peanut butter and jam sandwich"),
V2 = c("2 slices of bread 1 tablespoon peanut butter",
"2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam"))
mydf
## V1
## 1 peanut butter sandwich
## 2 peanut butter and jam sandwich
## V2
## 1 2 slices of bread 1 tablespoon peanut butter
## 2 2 slices of bread 1 tablespoon peanut butter 1 tablespoon jam
You can first add in a delimiter that you don't expect in "V2", and use cSplit
from my "splitstackshape" to get the "long" dataset format.
library(splitstackshape)
mydf$V2 <- gsub(" (\\d+)", "|\\1", mydf$V2)
cSplit(mydf, "V2", "|", "long")
## V1 V2
## 1: peanut butter sandwich 2 slices of bread
## 2: peanut butter sandwich 1 tablespoon peanut butter
## 3: peanut butter and jam sandwich 2 slices of bread
## 4: peanut butter and jam sandwich 1 tablespoon peanut butter
## 5: peanut butter and jam sandwich 1 tablespoon jam
The following aren't really enough to post on their own as an answer, because they are variations on @Jota's approach, but I'm sharing them here for completeness:
strsplit
within "data.table"The split list
is automatically flattened into a single column....
library(data.table)
as.data.table(mydf)[, list(
V2 = unlist(strsplit(as.character(V2), '\\s(?=\\d)', perl=TRUE))), by = V1]
You can use unnest
from "tidyr" to expand the list column into a long form....
library(dplyr)
library(tidyr)
mydf %>%
mutate(V2 = strsplit(as.character(V2), " (?=\\d)", perl=TRUE)) %>%
unnest(V2)
Upvotes: 5