krish
krish

Reputation: 1438

Filter only specific strings from a column in R

I have a dataset with a column that has comma separated values. I need to parse through each value in this column and keep only specific values and delete the others.

The code and data that I have for this is:

myDf <- structure(list(GeogPreferences = structure(1:4, .Label = c("Central and East Europe, Europe, North America, West Europe, US", 
"Europe, North America, West Europe, US", "Global, North America", 
"Northeast, Southeast, West, US"), class = "factor")), .Names = "GeogPreferences", class = "data.frame", row.names = c(NA, 
-4L))

 regionInterest <- c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")
 k<-lapply(as.character(myDf$GeogPreferences),function(x) {
   z<-trimws(unlist(strsplit(x,split = ",")))
   z <- ifelse((z %in% regionInterest), z[z %in% regionInterest], z)
 })
 myDf$GeogPreferences<-unlist(k)

This is the error I am getting:

Error in `$<-.data.frame`(`*tmp*`, "GeogPreferences", value = c("Central and East Europe",
: replacement has 15 rows, data has 4

My dataset looks like this:

    GeogPreferences
1   Central and East Europe, Europe, North America, West Europe, US
2   Europe, North America, West Europe, US
3   Global, North America
4   Northeast, Southeast, West, US

If the column has any on the strings from regionInterest, I want to keep that string, else I want to delete it.

The output I am expecting is:

    GeogPreferences
1   North America, US
2   North America, US
3   North America
4   Northeast, Southeast, West, US

Can someone help me with what I am doing wrong? Thanks!

Upvotes: 1

Views: 1058

Answers (3)

David Arenburg
David Arenburg

Reputation: 92292

You should probably first split the data and only then run the subset. This will both improve efficiency (because strsplit it is vectorized) and the size of the vector in each split won't matter. Also, no need in trimws, it just makes your code inefficient. Instead, split on ", " while specifying fixed = TRUE. This will make strsplit work about X10 times faster as it won't use regex in order to split.

The following works using base R only

do.call(rbind, # you can use `rbind.data.frame` instead if you don't want a matrix
        lapply(strsplit(as.character(myDf$GeogPreferences), ", ", fixed = TRUE), 
               function(x) toString(x[x %in% regionInterest])))
#     [,1]                            
# [1,] "North America, US"             
# [2,] "North America, US"             
# [3,] "North America"                 
# [4,] "Northeast, Southeast, West, US"

Though the above solution (similarly to your own) is still a rowwise one. Instead, we could try to achieve the same effect by operating columnwise. And by "columnwise" I mean that if we will operate on a transposed split, the number of iterations will be of the size of the longest sentence in myDf$GeogPreferences (the number of the commas we split by) which should be significantly smaller than the number of rows in the data.

Her's an illustration using data.table::tstrsplit

tmp <- data.table::tstrsplit(myDf$GeogPreferences, ", ", fixed = TRUE)
res <- do.call(paste, 
               c(sep = ", ", 
                 lapply(tmp, function(x) replace(x, !x %in% regionInterest, NA_character_))))
gsub("NA, |, NA", "", res)
# [1] "North America, US" "North America, US" "North America" "Northeast, Southeast, West, US"

Here is simple benchmark on 100K rows data set

bigDF <- myDf[sample(nrow(myDf), 1e5, replace = TRUE),, drop = FALSE]

library(dplyr)
library(tidyr)
library(data.table)

tidyverse <- function(x) {
  x %>% 
    mutate(id = row_number()) %>% 
    separate_rows(GeogPreferences, sep = ",") %>% 
    mutate(GeogPreferences = trimws(GeogPreferences)) %>% 
    filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>% 
    group_by(id) %>% 
    summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>% 
    select(-id) 

}

MF <- function(x) {
    k <- lapply(as.character(x$GeogPreferences), function(x) {
      z <- trimws(unlist(strsplit(x, split = ",")))
      z <- z[z %in% regionInterest] 
    })
    sapply(k, paste, collapse = ", ")
}


DA1 <- function(x) {
  do.call(rbind,
          lapply(strsplit(as.character(x$GeogPreferences), ", ", fixed = TRUE), 
                 function(x) toString(x[x %in% regionInterest])))
}

DA2 <- function(x) {
  tmp <- data.table::tstrsplit(x$GeogPreferences, ", ", fixed = TRUE)
  res <- do.call(paste, 
                 c(sep = ", ", 
                   lapply(tmp, function(x) replace(x, !x %in% regionInterest, NA_character_))))
  gsub("NA, |, NA", "", res)
}

system.time(tidyverse(bigDF))
# user  system elapsed 
# 17.67    0.01   17.91 
system.time(MF(bigDF))
# user  system elapsed 
# 15.52    0.00   15.70 
system.time(DA1(bigDF))
# user  system elapsed 
# 0.97    0.00    1.00 
system.time(DA2(bigDF))
# user  system elapsed 
# 0.25    0.00    0.25 

So the other two solutions ran over 15 secs compared to my both solutions that ran less than a second

Upvotes: 3

Marc Flury
Marc Flury

Reputation: 341

If you prefer a solution closer to your approach change it to

regionInterest <- c("Americas", "North America", "US",
  "Northeast","Southeast","West","Midwest","Southwest")
k<-lapply(as.character(myDf$GeogPreferences),function(x) {
  z<-trimws(unlist(strsplit(x,split = ",")))
# this makes sure you only use z which are in regionInterest
  z <- z[z %in% regionInterest] 
})
# paste with collapse creates one value out of a vector of string seperated by the collapse argument
myDf$GeogPreferences<-sapply(k, paste, collapse = ", ")

I hope this helps

Upvotes: 2

Jake Kaupp
Jake Kaupp

Reputation: 8072

The error that you're getting is due to strsplit creating more rows than your input df. Also in your ifelse statement, you're returning z on FALSE so it's not doing what you intended.

Here's a tidyr/dplyr solution to your problem.

myDf %>% 
  mutate(id = row_number()) %>% 
  separate_rows(GeogPreferences, sep = ",") %>% 
  mutate(GeogPreferences = trimws(GeogPreferences)) %>% 
  filter(GeogPreferences %in% c("Americas", "North America", "US", "Northeast","Southeast","West","Midwest","Southwest")) %>% 
  group_by(id) %>% 
  summarize(GeogPreferences = toString(trimws(GeogPreferences))) %>% 
  select(-id) 


# A tibble: 4 × 1
                 GeogPreferences
                           <chr>
1              North America, US
2              North America, US
3                  North America
4 Northeast, Southeast, West, US

Upvotes: 3

Related Questions