CelineDion
CelineDion

Reputation: 1088

How do I count the number of instances of a value in a column across all rows?

I have a table that looks like this:

> dt
                      variant_id          transcript_id is_NL individual counts 
        1: chr1_10007059_A_G_b38 chr1_10033695_10072027     0 GTEX-111FC     44     
        2: chr1_10007059_A_G_b38 chr1_10033695_10072027     0 GTEX-111VG     32     
        3: chr1_10007059_A_G_b38 chr1_10033695_10072027     0 GTEX-1122O     34     
        4: chr1_10007059_A_G_b38 chr1_10033695_10072027     0 GTEX-113IC     17     
        5: chr1_10007059_A_G_b38 chr1_10033695_10072027     0 GTEX-113JC     37     
       ---                                                                           
159899076: chr9_92815645_A_C_b38 chr9_92535983_92536600     1  GTEX-X4XX      1     
159899077: chr9_92815645_A_C_b38 chr9_92535983_92536600     1  GTEX-XBEW      3     
159899078: chr9_92815645_A_C_b38 chr9_92535983_92536600     1  GTEX-Y5V6      0     
159899079: chr9_92815645_A_C_b38 chr9_92535983_92536600     1  GTEX-YEC4      0     
159899080: chr9_92815645_A_C_b38 chr9_92535983_92536600     1  GTEX-ZYY3      0     

There are ~600 unique individuals, and each variant_id and transcript_id pair is present in each individual (at least in the table, counts might be 0).

What I want to do is for each variant_id and transcript_id pair, find the number of instances where counts == 0 and counts > 0 for each is_NL == 0, 1, 2.

So below is a mock table (0 == HH, 1 == HN, 2 == NN)

variant_id     transcript_id     HH=0    HH>0    HN=0    HN>0    NN=0    NN>0
a              b                 2       1146    3571    3312    741     280
...

etc. I hope that makes sense.

To further describe, each row represents the count of the transcript detected in each individual given the splice-altering variant in variant_id. is_NL represents the genotype of the individual, where 0 means homozygous for the human allele and 2 means homozygous for the Neandertal allele, and 1 means heterozygous.

structure(list(variant_id = c("chr1_10007059_A_G_b38", "chr1_10007059_A_G_b38", 
"chr1_10007059_A_G_b38", "chr1_10007059_A_G_b38", "chr1_10007059_A_G_b38", 
"chr1_10007059_A_G_b38"), transcript_id = c("chr1_10033695_10072027", 
"chr1_10033695_10072027", "chr1_10033695_10072027", "chr1_10033695_10072027", 
"chr1_10033695_10072027", "chr1_10033695_10072027"), is_NL = c(0L, 
0L, 0L, 0L, 0L, 0L), individual = c("GTEX-111FC", "GTEX-111VG", 
"GTEX-1122O", "GTEX-113IC", "GTEX-113JC", "GTEX-117XS"), counts = c(44L, 
32L, 34L, 17L, 37L, 32L), nrows = c(1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA, 
6L), class = "data.frame")

Upvotes: 0

Views: 52

Answers (1)

Richard J. Acton
Richard J. Acton

Reputation: 915

Would something like this work:

library(dplyr)
library(tidyr)
dt %>% 
  group_by(variant_id,transcript_id,is_NL) %>% 
  mutate(grp = case_when(counts == 0 ~ "0", counts > 0 ~ ">0")) %>% 
  count(grp) %>%
  ungroup() %>%
  mutate(
    is_NL = case_when(
      is_NL == 0 ~ "HH", # in the case when is_NL is 0 return HH etc.
      is_NL == 1 ~ "HN",
      is_NL == 2 ~ "NN",
    )
  ) %>% 
  # unite pastes the contents of the columns `is_NL` and `grp` together into the column `comb` separated by the value in `sep
  unite(comb, is_NL, grp, sep="") %>% 
  pivot_wider(names_from = comb, values_from = n)

The tildes are just a peculiarity of the syntax for case_when which is an awesome function for extended ifelse like cases where you need more conditions than just a single boolean, it's well worth checking out the documentation: https://dplyr.tidyverse.org/reference/case_when.html . You can always check out what each step does by just commenting out a given pipe (%>%) and everything after it.

Upvotes: 1

Related Questions