xyz123
xyz123

Reputation: 651

How can I test for inequality for factors for an unequal number of levels?

For instance, imagine that you are looking to count all identical residues in a 80 residue peptide, where a match occurs when the residue occurs at the same position in another peptide. But the catch is that the number of levels is probably not the same, as some letters [A - Z] which represent peptides will be present in one peptide but not in the next. For simplicity, imagine that we are looking for exactly identical residues (the letters match at these same positions) in all three peptides, and so the answer is a BOOLEAN TRUE or FALSE statement, where TRUE is if they all match and FALSE is if they do not match. Again the catch is that the number of factors are not the same so you can't test peptide_x == peptide_y.

Coding:

> peptide_x <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))
> peptide_y <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))
> peptide_z <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))

You can check which letters from the alphabet of 26 residues are missing in your peptide with the command:

> setdiff(LETTERS[1:26], peptide_x)

[1] "Y"

So we see that "Y" (Tyrosine) is missing. When you create the random peptide, you might be missing another letter or two, and you can do this for any of the peptides.

If I try to compare factors with equal levels, then that works:

> x <- c("M", "N", "A", "Q", "C")
> y <- c("N", "M", "A", "C", "Q") 
> xy_frame <- data.frame(x,y)
> xy_frame
> x == y

[1] FALSE FALSE TRUE FALSE FALSE As you can see, the A's match up, so the third element "A" is the only truth.

Shockingly this test works:

> x <- c("A", "A", "B", "Q", "C")
> y <- c("A", "Q", "C", "D", "R")
> x == y
[1]  TRUE FALSE FALSE FALSE FALSE

even though the number of factors is not the same. So I wonder if there is something wrong with my data type which is why I can't test this:

> peptides <- data.frame(peptide_x, peptide_y)
> peptides$peptide_x == peptides$peptide_y

Error in Ops.factor(peptides$peptide_x, peptides$peptide_y) : level sets of factors are different

So how can I fix my data type if that's the issue, or am I running the right test?

I just want to count TRUE - FALSE for non-identical factor levels.

Comment:

Is the %in% not working correctly because ...

head(peptide_x) [1] "C" "T" "X" "Z" "M" "A"

head(peptide_y) [1] "R" "G" "T" "U" "G" "U"

head(peptide_x %in% peptide_y) [1] TRUE TRUE TRUE TRUE TRUE TRUE

The first 6 letters of each peptide, for example, don't match up, but it says TRUE! How?

Upvotes: 1

Views: 280

Answers (2)

T. Scharf
T. Scharf

Reputation: 4844

make all the levels exist, even if they aren't present

x <- factor(sample(LETTERS[1:26], replace = TRUE, 80), levels = LETTERS) 
y <- factor(sample(LETTERS[1:26], replace = TRUE, 80), levels = LETTERS) 
z <- factor(sample(LETTERS[1:26], replace = TRUE, 80), levels = LETTERS)

note how I am setting the levels in each vector the same, even if some don't exist this is ok

> x==y
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[27] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[40] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[53]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
[79] FALSE FALSE
> x==z
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[40] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[53] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[79] FALSE FALSE
> y==z
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
[14] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[27] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[40] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[53] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[79] FALSE FALSE

Alternatively you can cast them as characters and compare them

Upvotes: 2

Stedy
Stedy

Reputation: 7469

In reference to this code:

x <- c("A", "A", "B", "Q", "C")
y <- c("A", "Q", "C", "D", "R")
R> x == y
[1]  TRUE FALSE FALSE FALSE FALSE

This works because you are comparing character vector x to character vector y. I would just skip the factors and use a similar test with the %in% operator:

R> peptide_x <- sample(LETTERS[1:26], replace = TRUE, 80)
R> peptide_y <- sample(LETTERS[1:26], replace = TRUE, 80)

R> peptide_x %in% peptide_y
 [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
[20]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
[39]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[58]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
[77]  TRUE  TRUE  TRUE FALSE

Upvotes: 1

Related Questions