Reputation: 2227
I have a list matrix, where one of the "columns" is a list (I realize it's an odd dataset to work with, but I find it useful for other operations). Each entry of the list is either; (1) empty (integer(0)), (2) an integer, or (3) a vector of integers.
E.g. the R object "d.f", With d.f$ID an index vector, and d.f$Basket_List the list.
ID <- c(1,2,3,4,5,6,7,8,9)
Basket_List <- list(integer(0),c(123,987),c(123,123),456,
c(456,123),456,c(123,987),c(987,123),987)
d.f <- data.frame(ID)
d.f$Basket_List <- Basket_List
I'd like to create a new dataset that's a subset of the initial, based on whether or not "Basket_List" contains certain value(s). E.g. a subset of all the rows in d.f such that Bask_list has "123" or "123" & "987" -- or other more complicated conditions.
I've tried every variation of the following, but to no avail.
d.f2 <- subset(d.f, 123 %in% Basket_List)
d.f2 <- subset(d.f, 123 == any(Basket_List))
d.f2 <- d.f[which(123 %in% d.f$Basket_List,]
# should return the subset, with rows 2,3,5,7 & 8
My other issue is that'd I'll be running this operation over many millions of rows (it's transaction data), so I'd like to optimize it as much as possible for speed (I have a complicated for loop now, but it takes too much time).
If you think it might be useful, the data might also be set-up as the following:
ID <- c(1,2,2,3,3,4,5,5,6,7,7,8,8,9)
Basket <- c(NA,123,987,123,123,456,456,123,456,123,987,987,123,987)
alt.d.f <- data.frame(ID,Basket)
Upvotes: 4
Views: 6630
Reputation: 73
A slightly more readable solution using the purrr & dplyr libraries (and the magrittr pipe operator) would be:
library(dplyr)
library(purrr)
d.f %>% filter(map_lgl(Basket_List,contains,as.integer(123)))
Upvotes: 1
Reputation: 59980
A similar approach to @AriB is to use the any
operator, apply
ing across rows, like so:
d.f[ apply( d.f , 1 , function(x) any(unlist(x) %in% 123) ) , ]
# ID Basket_List
#2 2 123, 987
#3 3 123, 123
#5 5 456, 123
#7 7 123, 987
#8 8 987, 123
With the second set up of your data I imagine that it would be very fast, because you could simply subset like so:
df[ df$Basket %in% 123 , ]
# ID Basket
#NA NA NA
#2 2 123
#4 3 123
#5 3 123
#8 5 123
#10 7 123
#13 8 123
And if you only want the first instance of a row that contains Basket
value you can subsequently use match
with the unique IDs, as match
returns first match of it's first argument in it's second:
df2 <- df[ df$Basket %in% 123 , ]
df2[ match( unique(df2$ID) , df2$ID),]
# ID Basket
#NA NA NA
#2 2 123
#4 3 123
#8 5 123
#10 7 123
#13 8 123
The second setup of your data will be far faster than the first I think. In fact, let's do a rough benchmark with it on a 1 million row table:
DF <- data.frame( ID = sample(ID , 1e6 , repl=TRUE) , Basket = sample(Basket , 1e6 , repl = TRUE) )
df<-DF
system.time({
df2 <- df[ df$Basket %in% 123 , ]
df2[ match( unique(df2$ID) , df2$ID),]
})
# user system elapsed
# 0.16 0.00 0.16
nrow(df)
#[1] 1000000
nrow(df2)
#[1] 428187
Upvotes: 4
Reputation: 72739
You can use sapply
for this:
ID <- c(1,2,3,4,5,6,7,8,9)
Basket_List <- list(integer(0),c(123,987),c(123,123),456,
c(456,123),456,c(123,987),c(987,123),987)
d.f <- data.frame(ID)
sel <- sapply( Basket_List, function(bl,searchItem) {
any(searchItem %in% bl)
}, searchItem=c(123) )
> sel
[1] FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE
> d.f[sel,,drop=FALSE]
ID
2 2
3 3
5 5
7 7
8 8
Please be careful with your terminology. A data.frame is not a matrix. It's a type of list.
Speed-wise, sapply
is not the fastest, but the selection will be very fast since it is vectorized. If you need more speed, data.table
time.
Upvotes: 7