Reputation: 1088
I have a table that looks like this:
> dt
variant_id transcript_id is_NL individual counts
1: chr1_10007059_A_G_b38 chr1_10033695_10072027 0 GTEX-111FC 44
2: chr1_10007059_A_G_b38 chr1_10033695_10072027 0 GTEX-111VG 32
3: chr1_10007059_A_G_b38 chr1_10033695_10072027 0 GTEX-1122O 34
4: chr1_10007059_A_G_b38 chr1_10033695_10072027 0 GTEX-113IC 17
5: chr1_10007059_A_G_b38 chr1_10033695_10072027 0 GTEX-113JC 37
---
159899076: chr9_92815645_A_C_b38 chr9_92535983_92536600 1 GTEX-X4XX 1
159899077: chr9_92815645_A_C_b38 chr9_92535983_92536600 1 GTEX-XBEW 3
159899078: chr9_92815645_A_C_b38 chr9_92535983_92536600 1 GTEX-Y5V6 0
159899079: chr9_92815645_A_C_b38 chr9_92535983_92536600 1 GTEX-YEC4 0
159899080: chr9_92815645_A_C_b38 chr9_92535983_92536600 1 GTEX-ZYY3 0
There are ~600 unique individual
s, and each variant_id
and transcript_id
pair is present in each individual (at least in the table, counts might be 0).
What I want to do is for each variant_id
and transcript_id
pair, find the number of instances where counts == 0
and counts > 0
for each is_NL == 0, 1, 2
.
So below is a mock table (0 == HH
, 1 == HN
, 2 == NN
)
variant_id transcript_id HH=0 HH>0 HN=0 HN>0 NN=0 NN>0
a b 2 1146 3571 3312 741 280
...
etc. I hope that makes sense.
To further describe, each row represents the count
of the transcript
detected in each individual
given the splice-altering variant in variant_id
. is_NL
represents the genotype of the individual, where 0 means homozygous for the human allele and 2 means homozygous for the Neandertal allele, and 1 means heterozygous.
structure(list(variant_id = c("chr1_10007059_A_G_b38", "chr1_10007059_A_G_b38",
"chr1_10007059_A_G_b38", "chr1_10007059_A_G_b38", "chr1_10007059_A_G_b38",
"chr1_10007059_A_G_b38"), transcript_id = c("chr1_10033695_10072027",
"chr1_10033695_10072027", "chr1_10033695_10072027", "chr1_10033695_10072027",
"chr1_10033695_10072027", "chr1_10033695_10072027"), is_NL = c(0L,
0L, 0L, 0L, 0L, 0L), individual = c("GTEX-111FC", "GTEX-111VG",
"GTEX-1122O", "GTEX-113IC", "GTEX-113JC", "GTEX-117XS"), counts = c(44L,
32L, 34L, 17L, 37L, 32L), nrows = c(1L, 1L, 1L, 1L, 1L, 1L)), row.names = c(NA,
6L), class = "data.frame")
Upvotes: 0
Views: 52
Reputation: 915
Would something like this work:
library(dplyr)
library(tidyr)
dt %>%
group_by(variant_id,transcript_id,is_NL) %>%
mutate(grp = case_when(counts == 0 ~ "0", counts > 0 ~ ">0")) %>%
count(grp) %>%
ungroup() %>%
mutate(
is_NL = case_when(
is_NL == 0 ~ "HH", # in the case when is_NL is 0 return HH etc.
is_NL == 1 ~ "HN",
is_NL == 2 ~ "NN",
)
) %>%
# unite pastes the contents of the columns `is_NL` and `grp` together into the column `comb` separated by the value in `sep
unite(comb, is_NL, grp, sep="") %>%
pivot_wider(names_from = comb, values_from = n)
The tildes are just a peculiarity of the syntax for case_when
which is an awesome function for extended ifelse like cases where you need more conditions than just a single boolean, it's well worth checking out the documentation: https://dplyr.tidyverse.org/reference/case_when.html . You can always check out what each step does by just commenting out a given pipe (%>%
) and everything after it.
Upvotes: 1