Reputation: 15
I have a dataframe that consists of multiple rows, and I would like to split every row into two components based off of elements of a vector (essentially run strsplit with a vector as the 'pattern') in R.
The dataframe (only one column) looks something like this:
[,1]
[1,] "apple please fuji"
[2,] "pear help name"
[3,] "banana me mango"
Whereas my pattern vector could look like this: v <- c("please", "help", "me")
.
If possible, I would like my end output to be:
df$name df$part1 df$split df$part2
"apple please fuji" "apple" "please" "fuji"
"pear help name" "pear" "help" "name"
"banana me mango" "banana" "me" "mango"
I would appreciate any help with the in-between step of being able to isolate components based on a vector, but if there is an even easier way to put it into a dataframe, that would be great!. Thank you so much!
Upvotes: 0
Views: 1849
Reputation: 38500
Here are two methods in base R.
Start with a character vector:
text <- c("apple please fuji", "pear help name", "banana me mango")
Also, the desired variable names (for convenience)
varNames <- c("name", "part1", "split", "part2")
using regexec
and regmatches
As an alternative, you can also use regular expressions with the regmatches
/ regexec
combination to construct this dataset.
First, build a regular expression from v with paste
.
myRegex <- paste0("^(.*) +(", paste(v, collapse="|"), ") +(.*)$")
myRegex
[1] "^(.*)(please|help|me)(.*)$"
setNames(do.call(rbind.data.frame, regmatches(text, regexec(myRegex, text))), varNames)
this returns the same as above
name part1 split part2
1 apple please fuji apple please fuji
2 pear help name pear help name
3 banana me mango banana me mango
using strsplit
and do.call
First, split each element by v
tmp <- do.call(strsplit, list(text, split=v))
tmp
[[1]]
[1] "apple " " fuji"
[[2]]
[1] "pear " " name"
[[3]]
[1] "banana " " mango"
Now, rbind.data.frame
these, which drops the second column, and returns a data.frame cbind
the split and name variables, and then add names with setNames
.
setNames(cbind(text, do.call(rbind.data.frame, tmp), v)[c(1, 2, 4, 2)], varNames)
this returns
name part1 split part2
1 apple please fuji apple please apple
2 pear help name pear help pear
3 banana me mango banana me banana
Upvotes: 2
Reputation: 4534
This solution assumes the number of elements in v
is equal to the number of rows in the dataframe. You can use separate
from the tidyr
package to create part1
and part2
.
library(tidyverse)
df <- tibble(name = c("apple please fuji", "pear help name", "banana me mango"))
v <- c("please", "help", "me")
df %>%
separate(name, c("part1", "part2"), v, remove = FALSE) %>%
add_column(split = v, .before = "part2")
#> # A tibble: 3 x 4
#> name part1 split part2
#> <chr> <chr> <chr> <chr>
#> 1 apple please fuji apple please fuji
#> 2 pear help name pear help name
#> 3 banana me mango banana me mango
If you want to try and split each row using any element in v
then you could try pasting v
into a single pattern first before separating. I think something like this should work.
library(tidyverse)
library(stringr)
p <- paste0("\\b(?:", paste(v, collapse = "|"), ")\\b")
df %>%
separate(name, c("part1", "part2"), p, remove = FALSE) %>%
mutate(split = str_extract(name, p)) %>%
select(name, part1, split, part2)
#> # A tibble: 3 x 4
#> name part1 split part2
#> <chr> <chr> <chr> <chr>
#> 1 apple please fuji apple please fuji
#> 2 pear help name pear help name
#> 3 banana me mango banana me mango
Upvotes: 1
Reputation: 329
# Creating creating the df
name <- c("apple please fuji","pear help name","banana me mango")
# as.data.frame
df <- as.data.frame(name, stringsAsFactors = F)
# Initialize empty data frame.
df_n <- data.frame()
# Loop through the original rows of the df.
for(i in 1:nrow(df)){
for(j in 1:nrow(df)){
o <- strsplit(df$name, " ")[[i]][j]
}
}
# rename and assign new df (df_n) changes to original df.
df$part1 <- df_n$V1
df$part2 <- df_n$V2
df$part3 <- df_n$V3
print(df)
Upvotes: 0