Reputation: 651
For instance, imagine that you are looking to count all identical residues in a 80 residue peptide, where a match occurs when the residue occurs at the same position in another peptide. But the catch is that the number of levels is probably not the same, as some letters [A - Z] which represent peptides will be present in one peptide but not in the next. For simplicity, imagine that we are looking for exactly identical residues (the letters match at these same positions) in all three peptides, and so the answer is a BOOLEAN TRUE or FALSE statement, where TRUE is if they all match and FALSE is if they do not match. Again the catch is that the number of factors are not the same so you can't test peptide_x == peptide_y.
Coding:
> peptide_x <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))
> peptide_y <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))
> peptide_z <- as.factor(sample(LETTERS[1:26], replace = TRUE, 80))
You can check which letters from the alphabet of 26 residues are missing in your peptide with the command:
> setdiff(LETTERS[1:26], peptide_x)
[1] "Y"
So we see that "Y" (Tyrosine) is missing. When you create the random peptide, you might be missing another letter or two, and you can do this for any of the peptides.
If I try to compare factors with equal levels, then that works:
> x <- c("M", "N", "A", "Q", "C")
> y <- c("N", "M", "A", "C", "Q")
> xy_frame <- data.frame(x,y)
> xy_frame
> x == y
[1] FALSE FALSE TRUE FALSE FALSE As you can see, the A's match up, so the third element "A" is the only truth.
Shockingly this test works:
> x <- c("A", "A", "B", "Q", "C")
> y <- c("A", "Q", "C", "D", "R")
> x == y
[1] TRUE FALSE FALSE FALSE FALSE
even though the number of factors is not the same. So I wonder if there is something wrong with my data type which is why I can't test this:
> peptides <- data.frame(peptide_x, peptide_y)
> peptides$peptide_x == peptides$peptide_y
Error in Ops.factor(peptides$peptide_x, peptides$peptide_y) : level sets of factors are different
So how can I fix my data type if that's the issue, or am I running the right test?
I just want to count TRUE - FALSE for non-identical factor levels.
Comment:
Is the %in% not working correctly because ...
head(peptide_x) [1] "C" "T" "X" "Z" "M" "A"
head(peptide_y) [1] "R" "G" "T" "U" "G" "U"
head(peptide_x %in% peptide_y) [1] TRUE TRUE TRUE TRUE TRUE TRUE
The first 6 letters of each peptide, for example, don't match up, but it says TRUE! How?
Upvotes: 1
Views: 280
Reputation: 4844
make all the levels exist, even if they aren't present
x <- factor(sample(LETTERS[1:26], replace = TRUE, 80), levels = LETTERS)
y <- factor(sample(LETTERS[1:26], replace = TRUE, 80), levels = LETTERS)
z <- factor(sample(LETTERS[1:26], replace = TRUE, 80), levels = LETTERS)
note how I am setting the levels
in each vector the same, even if some don't exist this is ok
> x==y
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[27] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[40] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[53] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[79] FALSE FALSE
> x==z
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[40] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[53] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[79] FALSE FALSE
> y==z
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
[14] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[27] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[40] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[53] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[79] FALSE FALSE
Alternatively you can cast them as characters and compare them
Upvotes: 2
Reputation: 7469
In reference to this code:
x <- c("A", "A", "B", "Q", "C")
y <- c("A", "Q", "C", "D", "R")
R> x == y
[1] TRUE FALSE FALSE FALSE FALSE
This works because you are comparing character vector x
to character vector y
. I would just skip the factors and use a similar test with the %in%
operator:
R> peptide_x <- sample(LETTERS[1:26], replace = TRUE, 80)
R> peptide_y <- sample(LETTERS[1:26], replace = TRUE, 80)
R> peptide_x %in% peptide_y
[1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
[20] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[39] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[58] TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[77] TRUE TRUE TRUE FALSE
Upvotes: 1