Reputation: 436

dplyr calculations involving two columns of a data frame (R)

I'm pretty new to R and couldn't find a clear answer my question after extensively searching the web. I'm trying to get dplyr functions to do the following task:

I have the following data.frame as tibble: Columns starting with X. indicates different samples and rows indicate how much a specific gene is expressed.

 head(immgen_dat)
# A tibble: 6 x 212
  ProbeSetID GeneName  Description         X.proB_CLP_BM. X.proB_CLP_FL. X.proB_FrA_BM. X.proB_FrA_FL. X.proB_FrBC_BM.
       <int> <fct>     <fct>                        <dbl>          <dbl>          <dbl>          <dbl>           <dbl>
1   10344620 " Gm1056~ " predicted gene 1~           15.6           15.3           17.2           16.1            18.1
2   10344622 " Gm1056~ " predicted gene 1~          240.           255.           224.           312.            272. 
3   10344624 " Lypla1" " lysophospholipas~          421.           474.           349.           478.            459. 
4   10344633 " Tcea1"  " transcription el~          802.           950.           864.           968.           1056. 
5   10344637 " Atp6v1~ " ATPase H+ transp~          199.           262.           167.           267.            255. 
6   10344653 " Oprk1"  " opioid receptor ~           14.8           12.8           18.0           13.2            15.3
# ... with 204 more variables: X.proB_FrBC_FL. <dbl>,

I added a mean expression variable at the end for each gene by using the following code (the range of variables are the first and the last sample):

immgen_avg <- immgen_dat %>%
                 rowwise() %>% 
                   mutate(Average = mean(X.proB_CLP_BM.:X.MLP_FL.))

Here, I have a quick question: The returned mean value I get from this code doesn't match the average I calculated elsewhere (in Excel). I don't think there are any missing values.

What I'd like to do is the following: For each gene, I'd like to compare the sample values with the average value and calculate a log2-fold difference (log2 difference of gene expression in a sample compared to the average expression value across all the samples). I'd like to store this dataframe with the name of immgen_log2 and do some subsequent analyses. In this new data frame, I'd like to keep the gene names because I'm thinking to merge this with another data table to compare log2 change between different experiments.

What is the best way of doing this? I appreciate your answers.

Upvotes: 0

Answers (2)

Tino

Reputation: 2101

I'm not entirely sure whether I get it right what you need to do, but whenever using dplyr or tidyverse in general (also ggplot2), long representation of your data works best. I assume that you want to calculate the mean of all variables starting with X. for each ProbeSetID. Then, for each X.-column and ProbeSetID, calculate ratio and take log2, i.e. log2(X.bla/mean):

df <- read.table(text = 'ProbeSetID  X.proB_CLP_BM. X.proB_CLP_FL. X.proB_FrA_BM. X.proB_FrA_FL. X.proB_FrBC_BM.
           10344620        15.6           15.3           17.2           16.1            18.1
           10344622        240.           255.           224.           312.            272. 
           10344624        421.           474.           349.           478.            459. 
           10344633      802.           950.           864.           968.           1056. 
           10344637      199.           262.           167.           267.            255. 
           10344653      14.8           12.8           18.0           13.2            15.3', header = T)

library(dplyr)
library(tidyr)

result <- 
  df %>% 
  # transform to long:
  gather(key = key, value = value, grep(x = names(.), pattern = "^X\\.")) %>% 
  # group by IDs, ie make rowwise calculations if it was still wide, but faster:
  group_by(ProbeSetID) %>% 
  # calculate group-mean on the fly and calculate log-ratio directly:
  mutate(log2_ratio = log2(value / mean(value)))

# transform back to wide, if needed:
result %>% 
  # remove initial values to have only 1 value variable:
  select(-value) %>% 
  # go back to wide:
  spread(key = key, value = log2_ratio)


# or, if you want to keep all values:
df %>% 
  # transform to long:
  gather(key = key, value = value, grep(x = names(.), pattern = "^X\\.")) %>% 
  # group by IDs, ie make rowwise calculations if it was still wide, but faster:
  group_by(ProbeSetID) %>% 
  # calculate the mean of each observation:
  mutate(mean_value = mean(value)) %>% 
  # go back to wide:
  spread(key, value) %>% 
  # now do the transformation to each variable that begins with X.:
  mutate_at(.vars = vars(matches("^X\\.")), 
            .funs = funs(log2_ratio = log2(./mean_value)))

Upvotes: 0

hpesoj626

Reputation: 3619

I will explain what is happening in a short while, but one way to solve for the row means of your intended variables is:

immgen_dat %>%
  mutate(Average = apply(.[, 4:8], 1, mean)) %>%
  select(Average)

#   Average
# 1   16.46
# 2  260.60
# 3  436.20
# 4  928.00
# 5  230.00
# 6   14.82

To see what is happening with your code, we can use the do function as follows:

df2 <- immgen_dat %>%
  rowwise() %>%
  do(Average = .$X.proB_CLP_BM.:.$X.proB_FrBC_BM.) 
df2$Average[1]

# [[1]]
# [1] 15.6 16.6 17.6

You will see that : generates a sequence from 15.6 in steps of 1. You can see this explained in more detail by typing help(":"). So in

immgen_dat %>%
  rowwise() %>%
  mutate(Average = mean(X.proB_CLP_BM.:X.proB_FrBC_BM.))

you are computing the means of the values of these sequences.

Edit

The logarithm of the ratios is of course the differences of the logarithms (provided the denominator is nonzero). So you are trying to find the differences between the log2's of each of the other numerical variables from the log2 of the Average, you can do something like.

immgen_log2 <- immgen_dat
immgen_log2[,4:9] <- log(immgen_dat[,4:9])
immgen_log2[,4:8] <- sapply(immgen_log2[,4:8], func)

Upvotes: 1

dplyr calculations involving two columns of a data frame (R)

Answers (2)

Edit

Related Questions