Reputation: 11
I have a large df of genomic data for many different tumor samples. One column "mutation" reports the specific variant of a protein detected in the sample. Two additional columns "allele1" and "allele2" report the HLA-type associated with each sample (each sample will have two values for HLA because it there are two copies of this in the genome). I would like to generate a count of samples with unique combinations of "mutation" and "allele1" OR "allele2" without counting duplicates (i.e. if a sample contained "mutation" mut1, allele1" a2 and "allele2" a2, it should be counted only once).
df <- data.frame(mutation = c("mut1", "mut1"), allele1 = c("a1", "a2"), allele2 = c("a2", "a2"))
mutation allele1 allele2
mut1 a1 a2
mut1 a2 a2
I know I can use ddply in the following way:
qualities <- c("mutation", "allele1")
countedCombos <- ddply(df, qualities, nrow)
But how can I add a third column ("allele2") to my qualities parameter that is joined in an OR fashion to "allele1"? Running two separate analyses with "mutation" and "allele1" then "allele2" and then summing the counts doesn't work because for samples that have the same value for "allele1" and "allele2", they will be double counted.
Hope this is clear, tried to make it as generalizable as possible.
Thanks in advance!
My expected output for the sample data would be
df_count <- data.frame(mutation = c("mut1", "mut1"), allele = c("a1", "a2"), count = c(1, 2))
mutation allele count
mut1 a1 1
mut1 a2 2
Edit: Thanks for the help, unfortunately both of these solutions still seem to double count samples with the same allele1 and allele2 values. For example, enlarging the dataset and then recounting
df <- data.frame(mutation = c("mut1", "mut1", "mut1", "mut2"), allele1 = c("a1", "a2", "a2", "a2"), allele2 = c("a2", "a2", "a2", "a2"))
> df
mutation allele1 allele2
1 mut1 a1 a2
2 mut1 a2 a2
3 mut1 a2 a2
4 mut2 a2 a2
df %>%
pivot_longer(-mutation) %>%
distinct() %>%
count(mutation, value)
# A tibble: 3 × 3
mutation value n
<chr> <chr> <int>
1 mut1 a1 1
2 mut1 a2 2
3 mut2 a2 2
However, my desired output would be:
# Desired output:
mutation value n
<chr> <chr> <int>
1 mut1 a1 1
2 mut1 a2 2
3 mut2 a2 1
Upvotes: 1
Views: 65
Reputation: 33802
I think Jared did most of the work, but this small alteration generates the output shown in the question:
library(tidyverse)
df %>%
pivot_longer(-mutation) %>%
distinct() %>%
count(mutation, value)
Result:
# A tibble: 2 × 3
mutation value n
<chr> <chr> <int>
1 mut1 a1 1
2 mut1 a2 2
Upvotes: 1
Reputation: 26715
Thank you for editing your question to include the desired output - that makes more sense - here is one potential solution:
library(tidyverse)
df <- data.frame(mutation = c("mut1", "mut1"),
allele1 = c("a1", "a2"),
allele2 = c("a2", "a2"))
df_count <- data.frame(mutation = c("mut1", "mut1"),
allele = c("a1", "a2"),
count = c(1, 2))
df_count
#> mutation allele count
#> 1 mut1 a1 1
#> 2 mut1 a2 2
df %>%
pivot_longer(-mutation, values_to = "allele") %>%
distinct() %>%
group_by(mutation, allele) %>%
tally(name = "count")
#> # A tibble: 2 × 3
#> # Groups: mutation [1]
#> mutation allele count
#> <chr> <chr> <int>
#> 1 mut1 a1 1
#> 2 mut1 a2 2
Created on 2022-09-13 by the reprex package (v2.0.1)
Upvotes: 1