Reputation: 11
I have a dataframe filled with names.
For a given row in the dataframe, I'd like to compare that row to every row above it in the df and determine if the number of matching names is less than or equal to 4 for every row.
Toy Example where row 3 is the row of interest
"Jim","Dwight","Michael","Andy","Stanley","Creed"
"Jim","Dwight","Angela","Pam","Ryan","Jan"
"Jim","Dwight","Angela","Pam","Creed","Ryan" <--- row of interest
So first we'd compare row 3 to row 1 and see that the name overlap is 3, which meets the <= 4 criteria.
Then we'd compare row 3 to row 2 and see that the name overlap is 5 which fails the <= 4 criteria, ultimately returning a failed condition for being <=4 for every row above it.
Right now I am doing this operation using a for loop but the speed is much too slow for the dataframe size I am working with.
Upvotes: 0
Views: 118
Reputation: 3412
Similar solution as IceCreamToucan, but for any row.
For the data.frame:
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
)
For any row number i:
f <- function(i) {
if(i == 1) return(T)
r <- vapply(df[1:(i-1),], '%in%', unlist(df[i,]), FUN.VALUE = logical(i-1))
out_lgl <- rowSums(as.matrix(r)) <= 4
return(all(out_lgl))
}
Upvotes: 0
Reputation: 28685
Example data
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
), stringsAsFactors = FALSE)
df
# V1 V2 V3 V4 V5 V6
# 1 Jim Dwight Michael Andy Stanley Creed
# 2 Jim Dwight Angela Pam Ryan Jan
# 3 Jim Dwight Angela Pam Creed Ryan
Operation and output (sapply over columns with %in%
and take rowSums
)
out_lgl <- rowSums(sapply(df, '%in%', unlist(df[3,]))) <= 4
out_lgl
# [1] TRUE FALSE FALSE
which(out_lgl)
# [1] 1
Explanation:
For each column, each element is compared to the third row (the vector unlist(df[3,])
). The output is a matrix of logical values with the same dimensions as df
, TRUE
if there is a match.
sapply(df, '%in%', unlist(df[3,]))
# V1 V2 V3 V4 V5 V6
# [1,] TRUE TRUE FALSE FALSE FALSE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE
Then we can sum the TRUE
s to see the number of matches for each row
rowSums(sapply(df, '%in%', unlist(df[3,])))
# [1] 3 5 6
Edit:
I have added the stringsAsFactors = FALSE
option to the creation of df
above. However, as far as I can tell the output of %in%
is the same whether comparing factors with different levels or characters, so I don't believe this could change the results in any way. See example below
x <- c('b', 'c', 'z')
y <- c('a', 'b', 'g')
all.equal(x %in% y, factor(x) %in% factor(y))
# [1] TRUE
Upvotes: 1