Reputation: 747
I receive data that isn't in a very nice format (and I can't change them upstream). There is one column that needs to be reordered and split into 10+ other columns based on certain keywords.
Here's an example of data I receive - for each person, they have chosen one 3 different foods. Their choices for each food category (food1
, food2
, food3
) come right after the text:
list1 <- c(' food1 pasta food2 apple food3 carrot ')
list2 <- c(' food2 banana food3 cucumber food1 brown rice ')
list3 <- c(' food3 bell pepper food2 plum food1 bread ')
foodListDF <- as.data.frame(matrix(c(1,2,3, list1, list2, list3), nrow = 3), stringsAsFactors = FALSE)
colnames(foodListDF) <- c('Person', 'Choices')
foodListDF
Person Choices
1 1 food1 pasta food2 apple food3 carrot
2 2 food2 banana food3 cucumber food1 brown rice
3 3 food3 bell pepper food2 plum food1 bread
The above is the format I receive my data in. My end goal is to split the Choices
column into 3 separate columns labeled food1, food2, and food3 which requires things to be ordered properly:
Person food1 food2 food3
1 1 pasta apple carrot
2 2 brown rice banana cucumber
3 3 bread plum bell pepper
I know that I can split the choices doing something like this:
library(stringr)
as.data.frame(str_split_fixed(foodListDF$Choices, c(' food1 | food2 | food3 '), 4))[,2:4]
V2 V3 V4
1 pasta apple carrot
2 banana cucumber brown rice
3 bell pepper plum bread
But this obviously doesn't split them into their proper groups/order which is very necessary.
I'm really just struggling to think how to extract the correct food from the proper group for each person. Any ideas?
Upvotes: 3
Views: 628
Reputation: 39657
You can use food
as a delimiter in strsplit
, sort
the result, remove the first character with substring
and return the result to your dataset.
foodListDF[paste0("food",1:3)] <- t(sapply(strsplit(foodListDF$Choices, "food"),
function (x) trimws(substring(sort(x[-1]), 2))))
foodListDF[-2]
# Person food1 food2 food3
#1 1 pasta apple carrot
#2 2 brown rice banana cucumber
#3 3 bread plum bell pepper
Or in case there are not all the time all levels present:
j <- sort(unique(unlist(regmatches(foodListDF$Choices, gregexpr("food\\d+",
foodListDF$Choices)))))
k <- sub("food", "", j)
foodListDF[j] <- t(sapply(strsplit(foodListDF$Choices, "food"), function(x)
trimws(sub("^\\d+", "", x[charmatch(k, x)]))))
foodListDF[-2]
# Person food1 food2 food3
#1 1 pasta apple carrot
#2 2 brown rice banana cucumber
#3 3 bread plum bell pepper
Upvotes: 0
Reputation: 5788
In one convoluted Base R expression:
data.frame(cbind(Person = foodListDF$Person,
do.call("rbind", Map(function(x){y <- setNames(x[[2]], x[[1]]); y[order(x[[1]])]},
lapply(strsplit(foodListDF$Choices, "\\s+"), function(x) {
res <- data.frame(t(grep("food\\d+", x, value = TRUE)), stringsAsFactors = FALSE)
res2 <- unlist(strsplit(gsub("^&&\\s*", "",
paste0(Filter(function(y){y != ""}, Vectorize(gsub)("food\\d+", "&&", x)),
collapse = " ")), "\\s*&&\\s*"))
list(res, res2)
}
)
)
)
), stringsAsFactors = FALSE)
Upvotes: 0
Reputation: 193517
Here are two base R approaches, both involving regmatches
and gregexpr
.
The first makes use of unstack
. It results in a data.frame
.
splitfun1 <- function(string) {
mat <- gregexpr("food\\d+ ", string)
unstack(
list(l1 = unlist(lapply(regmatches(string, mat), trimws), use.names = FALSE),
l2 = unlist(lapply(regmatches(string, mat, invert = TRUE),
function(x) trimws(x[-1])), use.names = FALSE)),
l2 ~ l1)
}
splitfun1(foodListDF$Choices)
# food1 food2 food3
# 1 pasta apple carrot
# 2 brown rice banana cucumber
# 3 bread plum bell pepper
The second makes use of matrix indexing to fill in an empty matrix. It's probably a bit more efficient than the first alternative. It results in a matrix.
splitfun2 <- function(string) {
mat <- gregexpr("food\\d+ ", string)
l1 <- lapply(regmatches(string, mat), trimws)
l2 <- lapply(regmatches(string, mat, invert = TRUE),
function(x) trimws(x[-1]))
ul <- unlist(l1, use.names = FALSE)
cn <- sort(unique(ul))
out <- matrix(NA_character_, nrow = length(string), ncol = length(cn),
dimnames = list(seq_along(string), cn))
out[cbind(rep(seq_along(string), lengths(l1)), ul)] <- unlist(l2, use.names = FALSE)
out
}
splitfun2(foodListDF$Choices)
# food1 food2 food3
# 1 "pasta" "apple" "carrot"
# 2 "brown rice" "banana" "cucumber"
# 3 "bread" "plum" "bell pepper"
Of course, with either of these, you would then need to cbind
the result with the relevant columns from the source data.frame
.
cbind(foodListDF[1], splitfun2(foodListDF$Choices))
Another option is to use cSplit
from my "splitstackshape" package along with some pretty straightforward gsub
work, followed by dcast
to go into a wide form.
library(splitstackshape)
# library(data.table) # if required
# Basic helper function
fun <- function(string) {
list(gsub("(food\\d+) (.*)", "\\1", string),
gsub("(food\\d+) (.*)", "\\2", string))
}
cSplit(as.data.table(foodListDF)[, Choices := gsub(" food", ",food", trimws(Choices))],
"Choices", ",", "long")[, fun(Choices), Person][, dcast(.SD, Person ~ V1, value.var = "V2")]
# Person food1 food2 food3
# 1: 1 pasta apple carrot
# 2: 2 brown rice banana cucumber
# 3: 3 bread plum bell pepper
Adapting the above to "dplyr" + "tidyr", you can try:
library(dplyr)
library(tidyr)
foodListDF %>%
mutate(Choices = gsub(" food", ",food", trimws(Choices))) %>%
separate_rows(Choices, sep = ",") %>%
separate(Choices, c("var", "val"), extra = "merge") %>%
pivot_wider(names_from = var, values_from = val)
# # A tibble: 3 x 4
# Person food1 food2 food3
# <chr> <chr> <chr> <chr>
# 1 1 pasta apple carrot
# 2 2 brown rice banana cucumber
# 3 3 bread plum bell pepper
Upvotes: 1
Reputation: 388972
You could extract the food number and food items separately (t1
and t2
), join them together, unnest
the data and get it into wide format.
library(dplyr)
library(tidyr)
foodListDF %>%
mutate(food = stringr::str_extract_all(Choices, 'food\\d+')) %>%
select(-Choices) -> t1
foodListDF %>%
separate_rows(Choices, sep = 'food\\d+') %>%
filter(Choices != ' ') %>%
mutate(Choices = trimws(Choices)) %>%
group_by(Person) %>%
summarise(col = list(Choices)) -> t2
inner_join(t1, t2, by = 'Person') %>%
unnest(c(food, col)) %>%
pivot_wider(names_from = food, values_from = col)
# Person food1 food2 food3
# <chr> <chr> <chr> <chr>
#1 1 pasta apple carrot
#2 2 brown rice banana cucumber
#3 3 bread plum bell pepper
Upvotes: 1