Reputation: 1438
I have a dataset with a column that has comma separated values. I need to parse through each value in this column and keep only specific values and delete the others.
The code and data that I have for this is:
myDf <- structure(list(GeogPreferences = structure(1:4, .Label = c("Central and East Europe, Europe, North America, West Europe, US",
"Europe, North America, West Europe, US", "Global, North America",
"Northeast, Southeast, West, US"), class = "factor")), .Names = "GeogPreferences", class = "data.frame", row.names = c(NA,
-4L))
regionInterest <- c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")
k<-lapply(as.character(myDf$GeogPreferences),function(x) {
z<-trimws(unlist(strsplit(x,split = ",")))
z <- ifelse((z %in% regionInterest), z[z %in% regionInterest], z)
})
myDf$GeogPreferences<-unlist(k)
This is the error I am getting:
Error in `$<-.data.frame`(`*tmp*`, "GeogPreferences", value = c("Central and East Europe",
: replacement has 15 rows, data has 4
My dataset looks like this:
GeogPreferences
1 Central and East Europe, Europe, North America, West Europe, US
2 Europe, North America, West Europe, US
3 Global, North America
4 Northeast, Southeast, West, US
If the column has any on the strings from regionInterest
, I want to keep that string, else I want to delete it.
The output I am expecting is:
GeogPreferences
1 North America, US
2 North America, US
3 North America
4 Northeast, Southeast, West, US
Can someone help me with what I am doing wrong? Thanks!
Upvotes: 1
Views: 1058
Reputation: 92292
You should probably first split the data and only then run the subset.
This will both improve efficiency (because strsplit
it is vectorized) and the size of the vector in each split won't matter. Also, no need in trimws
, it just makes your code inefficient. Instead, split on ", "
while specifying fixed = TRUE
. This will make strsplit
work about X10 times faster as it won't use regex in order to split.
The following works using base R only
do.call(rbind, # you can use `rbind.data.frame` instead if you don't want a matrix
lapply(strsplit(as.character(myDf$GeogPreferences), ", ", fixed = TRUE),
function(x) toString(x[x %in% regionInterest])))
# [,1]
# [1,] "North America, US"
# [2,] "North America, US"
# [3,] "North America"
# [4,] "Northeast, Southeast, West, US"
Though the above solution (similarly to your own) is still a rowwise one. Instead, we could try to achieve the same effect by operating columnwise. And by "columnwise" I mean that if we will operate on a transposed split, the number of iterations will be of the size of the longest sentence in myDf$GeogPreferences
(the number of the commas we split by) which should be significantly smaller than the number of rows in the data.
Her's an illustration using data.table::tstrsplit
tmp <- data.table::tstrsplit(myDf$GeogPreferences, ", ", fixed = TRUE)
res <- do.call(paste,
c(sep = ", ",
lapply(tmp, function(x) replace(x, !x %in% regionInterest, NA_character_))))
gsub("NA, |, NA", "", res)
# [1] "North America, US" "North America, US" "North America" "Northeast, Southeast, West, US"
Here is simple benchmark on 100K rows data set
bigDF <- myDf[sample(nrow(myDf), 1e5, replace = TRUE),, drop = FALSE]
library(dplyr)
library(tidyr)
library(data.table)
tidyverse <- function(x) {
x %>%
mutate(id = row_number()) %>%
separate_rows(GeogPreferences, sep = ",") %>%
mutate(GeogPreferences = trimws(GeogPreferences)) %>%
filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>%
group_by(id) %>%
summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>%
select(-id)
}
MF <- function(x) {
k <- lapply(as.character(x$GeogPreferences), function(x) {
z <- trimws(unlist(strsplit(x, split = ",")))
z <- z[z %in% regionInterest]
})
sapply(k, paste, collapse = ", ")
}
DA1 <- function(x) {
do.call(rbind,
lapply(strsplit(as.character(x$GeogPreferences), ", ", fixed = TRUE),
function(x) toString(x[x %in% regionInterest])))
}
DA2 <- function(x) {
tmp <- data.table::tstrsplit(x$GeogPreferences, ", ", fixed = TRUE)
res <- do.call(paste,
c(sep = ", ",
lapply(tmp, function(x) replace(x, !x %in% regionInterest, NA_character_))))
gsub("NA, |, NA", "", res)
}
system.time(tidyverse(bigDF))
# user system elapsed
# 17.67 0.01 17.91
system.time(MF(bigDF))
# user system elapsed
# 15.52 0.00 15.70
system.time(DA1(bigDF))
# user system elapsed
# 0.97 0.00 1.00
system.time(DA2(bigDF))
# user system elapsed
# 0.25 0.00 0.25
So the other two solutions ran over 15 secs compared to my both solutions that ran less than a second
Upvotes: 3
Reputation: 341
If you prefer a solution closer to your approach change it to
regionInterest <- c("Americas", "North America", "US",
"Northeast","Southeast","West","Midwest","Southwest")
k<-lapply(as.character(myDf$GeogPreferences),function(x) {
z<-trimws(unlist(strsplit(x,split = ",")))
# this makes sure you only use z which are in regionInterest
z <- z[z %in% regionInterest]
})
# paste with collapse creates one value out of a vector of string seperated by the collapse argument
myDf$GeogPreferences<-sapply(k, paste, collapse = ", ")
I hope this helps
Upvotes: 2
Reputation: 8072
The error that you're getting is due to strsplit
creating more rows than your input df. Also in your ifelse
statement, you're returning z on FALSE
so it's not doing what you intended.
Here's a tidyr
/dplyr
solution to your problem.
myDf %>%
mutate(id = row_number()) %>%
separate_rows(GeogPreferences, sep = ",") %>%
mutate(GeogPreferences = trimws(GeogPreferences)) %>%
filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>%
group_by(id) %>%
summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>%
select(-id)
# A tibble: 4 × 1
GeogPreferences
<chr>
1 North America, US
2 North America, US
3 North America
4 Northeast, Southeast, West, US
Upvotes: 3