Mike Gormally
Mike Gormally

Reputation: 11

Counting unique combinations of values in 1 column and 2 additional columns without duplicates

I have a large df of genomic data for many different tumor samples. One column "mutation" reports the specific variant of a protein detected in the sample. Two additional columns "allele1" and "allele2" report the HLA-type associated with each sample (each sample will have two values for HLA because it there are two copies of this in the genome). I would like to generate a count of samples with unique combinations of "mutation" and "allele1" OR "allele2" without counting duplicates (i.e. if a sample contained "mutation" mut1, allele1" a2 and "allele2" a2, it should be counted only once).

    df <- data.frame(mutation = c("mut1", "mut1"), allele1 = c("a1", "a2"), allele2 = c("a2", "a2"))

mutation allele1 allele2
mut1     a1      a2
mut1     a2      a2    

I know I can use ddply in the following way:

qualities <- c("mutation", "allele1")
countedCombos <- ddply(df, qualities, nrow)

But how can I add a third column ("allele2") to my qualities parameter that is joined in an OR fashion to "allele1"? Running two separate analyses with "mutation" and "allele1" then "allele2" and then summing the counts doesn't work because for samples that have the same value for "allele1" and "allele2", they will be double counted.

Hope this is clear, tried to make it as generalizable as possible.

Thanks in advance!

My expected output for the sample data would be

df_count <- data.frame(mutation = c("mut1", "mut1"), allele = c("a1", "a2"), count = c(1, 2))

mutation allele count
mut1     a1     1
mut1     a2     2

Edit: Thanks for the help, unfortunately both of these solutions still seem to double count samples with the same allele1 and allele2 values. For example, enlarging the dataset and then recounting

df <- data.frame(mutation = c("mut1", "mut1", "mut1", "mut2"), allele1 = c("a1", "a2", "a2", "a2"), allele2 = c("a2", "a2", "a2", "a2"))  

> df
  mutation allele1 allele2
1     mut1      a1      a2
2     mut1      a2      a2
3     mut1      a2      a2
4     mut2      a2      a2


df %>%    
pivot_longer(-mutation) %>%    
distinct() %>%    
count(mutation, value)

# A tibble: 3 × 3
  mutation value     n
  <chr>    <chr> <int>
1 mut1     a1        1
2 mut1     a2        2
3 mut2     a2        2

However, my desired output would be:

# Desired output:
  mutation value     n
  <chr>    <chr> <int>
1 mut1     a1        1
2 mut1     a2        2
3 mut2     a2        1

Upvotes: 1

Views: 65

Answers (2)

neilfws
neilfws

Reputation: 33802

I think Jared did most of the work, but this small alteration generates the output shown in the question:

library(tidyverse)

df %>% 
  pivot_longer(-mutation) %>% 
  distinct() %>% 
  count(mutation, value)

Result:

# A tibble: 2 × 3
  mutation value     n
  <chr>    <chr> <int>
1 mut1     a1        1
2 mut1     a2        2

Upvotes: 1

jared_mamrot
jared_mamrot

Reputation: 26715

Thank you for editing your question to include the desired output - that makes more sense - here is one potential solution:

library(tidyverse)

df <- data.frame(mutation = c("mut1", "mut1"),
                 allele1 = c("a1", "a2"),
                 allele2 = c("a2", "a2"))

df_count <- data.frame(mutation = c("mut1", "mut1"),
                       allele = c("a1", "a2"),
                       count = c(1, 2))

df_count
#>   mutation allele count
#> 1     mut1     a1     1
#> 2     mut1     a2     2

df %>%
  pivot_longer(-mutation, values_to = "allele") %>%
  distinct() %>%
  group_by(mutation, allele) %>%
  tally(name = "count")
#> # A tibble: 2 × 3
#> # Groups:   mutation [1]
#>   mutation allele count
#>   <chr>    <chr>  <int>
#> 1 mut1     a1         1
#> 2 mut1     a2         2

Created on 2022-09-13 by the reprex package (v2.0.1)

Upvotes: 1

Related Questions