madscout
madscout

Reputation: 11

R: Test for overlap of name values in dataframe

I have a dataframe filled with names.

For a given row in the dataframe, I'd like to compare that row to every row above it in the df and determine if the number of matching names is less than or equal to 4 for every row.

Toy Example where row 3 is the row of interest

  1. "Jim","Dwight","Michael","Andy","Stanley","Creed"

  2. "Jim","Dwight","Angela","Pam","Ryan","Jan"

  3. "Jim","Dwight","Angela","Pam","Creed","Ryan" <--- row of interest

So first we'd compare row 3 to row 1 and see that the name overlap is 3, which meets the <= 4 criteria.

Then we'd compare row 3 to row 2 and see that the name overlap is 5 which fails the <= 4 criteria, ultimately returning a failed condition for being <=4 for every row above it.

Right now I am doing this operation using a for loop but the speed is much too slow for the dataframe size I am working with.

Upvotes: 0

Views: 118

Answers (2)

SmokeyShakers
SmokeyShakers

Reputation: 3412

Similar solution as IceCreamToucan, but for any row.

For the data.frame:

df <- as.data.frame(rbind(
  c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
  c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
  c("Jim","Dwight","Angela","Pam","Creed","Ryan")
)

For any row number i:

f <- function(i) {
  if(i == 1) return(T)
  r <- vapply(df[1:(i-1),], '%in%', unlist(df[i,]), FUN.VALUE = logical(i-1))
  out_lgl <- rowSums(as.matrix(r)) <= 4
  return(all(out_lgl))
}

Upvotes: 0

IceCreamToucan
IceCreamToucan

Reputation: 28685

Example data

df <- as.data.frame(rbind(
  c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
), stringsAsFactors = FALSE)

df
#    V1     V2      V3   V4      V5    V6
# 1 Jim Dwight Michael Andy Stanley Creed
# 2 Jim Dwight  Angela  Pam    Ryan   Jan
# 3 Jim Dwight  Angela  Pam   Creed  Ryan

Operation and output (sapply over columns with %in% and take rowSums)

out_lgl <- rowSums(sapply(df, '%in%', unlist(df[3,]))) <= 4

out_lgl
# [1]  TRUE FALSE FALSE
which(out_lgl)
# [1] 1

Explanation:

For each column, each element is compared to the third row (the vector unlist(df[3,])). The output is a matrix of logical values with the same dimensions as df, TRUE if there is a match.

sapply(df, '%in%', unlist(df[3,]))

#        V1   V2    V3    V4    V5    V6
# [1,] TRUE TRUE FALSE FALSE FALSE  TRUE
# [2,] TRUE TRUE  TRUE  TRUE  TRUE FALSE
# [3,] TRUE TRUE  TRUE  TRUE  TRUE  TRUE

Then we can sum the TRUEs to see the number of matches for each row

rowSums(sapply(df, '%in%', unlist(df[3,])))
# [1] 3 5 6

Edit:

I have added the stringsAsFactors = FALSE option to the creation of df above. However, as far as I can tell the output of %in% is the same whether comparing factors with different levels or characters, so I don't believe this could change the results in any way. See example below

x <- c('b', 'c', 'z')
y <- c('a', 'b', 'g')

all.equal(x %in% y, factor(x) %in% factor(y))
# [1] TRUE

Upvotes: 1

Related Questions