Crissy Widjaja
Crissy Widjaja

Reputation: 13

compare values of one column with values of multiple other columns

I have one column (VariableA) of unique IDs that I want to compare with multiple columns (VariableB, VariableC...).

Each column has unique IDs but an ID may be in two or more columns like so:

VariableA VariableB VariableC 0001 0001 0008 0002 0003 0001 0004 0004 0002 0005 0006 0007 NA

I was using ainb <- variablea[,1] %in% variableb[,1] to check whether or not values in VariableA existed in VariableB, which worked, but I cannot search figure out how to get a count of values in VariableA that exist in more than one column (VariableB, VariableC, etc).

Using ainall <- variablea[,1] %in% dat[,2:6] just gave me all FALSE. I believe the problem may be that the length of the columns are all different. I'm not trying to match by row, just each individual value.

Hope that's clear enough! Any help appreciated.

Upvotes: 1

Views: 1869

Answers (1)

Lizzie Silver
Lizzie Silver

Reputation: 293

Here's a quick example. First, let's create an example data frame:

data_example <- data.frame(var_a = 1:10, var_b = c(1:5,11:15), var_c=c(1:7, NA, NA, NA))
> data_example
   var_a var_b var_c
1      1     1     1
2      2     2     2
3      3     3     3
4      4     4     4
5      5     5     5
6      6    11     6
7      7    12     7
8      8    13    NA
9      9    14    NA
10    10    15    NA

So you can see that var_a overlaps with var_b by 50%, and var_c by 70% (and var_c has some NAs).

Let's see which unique values overlap, for each column:

overlap_vals = apply(data_example, 2, intersect, data_example$var_a)
> overlap_vals
$var_a
 [1]  1  2  3  4  5  6  7  8  9 10

$var_b
[1] 1 2 3 4 5

$var_c
[1] 1 2 3 4 5 6 7

Now we want to know how many elements of var_a appear in these overlap vectors. Start by creating a little helper function that gets the proportion overlap for one vector in the list:

propn_overlap <- function(comparison_vector, id_vector){ 
                     sum(id_vector %in% comparison_vector) / length(id_vector)
                 }

Now we can use sapply to apply our helper function over all the vectors in the list:

> sapply(overlap_vals, propn_overlap, data_example$var_a)
var_a var_b var_c 
  1.0   0.5   0.7 

Upvotes: 2

Related Questions