Bob Settlage
Bob Settlage

Reputation: 25

R dplyr how to summarize

Essentially, I am trying to make a series of plots with log2 fold-change on the y-axis and mean counts on the y-axis (the observation is genes). These are commonly called MA plots. The issue I am having is getting my data into the right form. I can do this through a loop, but would like to know the right way to do it.

At this point, I have two data frames: my design matrix and my data matrix. The design matrix looks like so (call it ED_df):

SampleID     Patient Grade Batch
MD48L_2_B_L1    MD48    G2 Feb15
MD48R_3_B_L1    MD48    G3 Feb15
MD53L_2_B_L1    MD53    G2 Feb15
MD53R_3_B_L1    MD53    G3 Feb15
MD58L_2_B_L1    MD58    G2 Sep15
MD58R_3_B_L1    MD58    G3 Sep15

dim(ED_df)
# [1] 18 6

Each row is a unique sample. Each sample comes from patient+grade+batch. In this case, all patients are paired around grade (G2 or G3). There are 3 total batches. Two patients were replicated across either batch 1 and 2 or batch 2 and 3.

My data matrix looks like so (call it data_df):

      Gene          MD48L_2_B_L1 MD48R_3_B_L1 MD53L_2_B_L1 MD53R_3_B_L1 MD58L_2_B_L1
1 ENSG00000000003    364.26079    329.28730    531.52188    371.67413   275.745038
2 ENSG00000000005     18.92264     49.89201     42.18428     19.42548     1.948728
3 ENSG00000000419    270.59373    261.65590    284.74386    414.41018   293.283591
4 ENSG00000000457    145.70432    125.28439    122.33440    129.50318   148.103342

dim(data_df)
# [1] 31707 18

Each column corresponds to a unique sample.

What I am wanting to do is to get, for each gene, a log2 fold-change (G3/G2) within each patient-batch set. Additionally, I want to get mean (G3, G2) for each patient-batch set.

I will then plot this as an MA plot.

Again, I can see how to do this painfully through a nested for-loop, what I would like to do is figure out how to do this via some sort of aggregating function.

Upvotes: 0

Views: 416

Answers (1)

bramtayl
bramtayl

Reputation: 4024

Two more steps: spread grade so G2 and G3 end up in different columns, then summarize. I'm not sure if I exactly understood the aggregation process you wanted, but I've taken a stab. I included the psych package for the gm (geometric mean) function. This is important when working with ratio data.

library(dplyr)
library(tidyr)
library(psych)

data_df %>%
  as.data.frame %>%
  gather(sample, measurement, -gene) %>%
  left_join(ED_df) %>%
  spread(Grade, measurement) %>%
  group_by(Patient, Batch) %>%
  summarize(G2_geometric_mean = G2 %>% gm,
            G3_geometric_mean = G3 %>% gm) %>%
  mutate(geometric_mean_ratio = G3_geometric_mean / G2_geometric_mean)

Upvotes: 1

Related Questions