Reputation: 13
I have one column (VariableA) of unique IDs that I want to compare with multiple columns (VariableB, VariableC...).
Each column has unique IDs but an ID may be in two or more columns like so:
VariableA VariableB VariableC
0001 0001 0008
0002 0003 0001
0004 0004 0002
0005 0006
0007 NA
I was using ainb <- variablea[,1] %in% variableb[,1]
to check whether or not values in VariableA existed in VariableB, which worked, but I cannot search figure out how to get a count of values in VariableA that exist in more than one column (VariableB, VariableC, etc).
Using ainall <- variablea[,1] %in% dat[,2:6]
just gave me all FALSE. I believe the problem may be that the length of the columns are all different. I'm not trying to match by row, just each individual value.
Hope that's clear enough! Any help appreciated.
Upvotes: 1
Views: 1869
Reputation: 293
Here's a quick example. First, let's create an example data frame:
data_example <- data.frame(var_a = 1:10, var_b = c(1:5,11:15), var_c=c(1:7, NA, NA, NA))
> data_example
var_a var_b var_c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 11 6
7 7 12 7
8 8 13 NA
9 9 14 NA
10 10 15 NA
So you can see that var_a
overlaps with var_b
by 50%, and var_c
by 70% (and var_c
has some NAs).
Let's see which unique values overlap, for each column:
overlap_vals = apply(data_example, 2, intersect, data_example$var_a)
> overlap_vals
$var_a
[1] 1 2 3 4 5 6 7 8 9 10
$var_b
[1] 1 2 3 4 5
$var_c
[1] 1 2 3 4 5 6 7
Now we want to know how many elements of var_a
appear in these overlap vectors. Start by creating a little helper function that gets the proportion overlap for one vector in the list:
propn_overlap <- function(comparison_vector, id_vector){
sum(id_vector %in% comparison_vector) / length(id_vector)
}
Now we can use sapply to apply our helper function over all the vectors in the list:
> sapply(overlap_vals, propn_overlap, data_example$var_a)
var_a var_b var_c
1.0 0.5 0.7
Upvotes: 2