Kyle
Kyle

Reputation: 155

preparing data frame in r for heatmap with ggplot2

Currently trying to create a heatmap of some genetic data. The columns are currently labeled s1, s2, s3, etc., but I also have a .txt file that has the correct corresponding labels for each sample. I'm not sure if I need to first modify the csv file with the levels of gene expression or if I can transfer them separately to the data frame I'm trying to prepare that will eventually be made into a heatmap. I'm also not sure exactly what the format of the dataframe should be. I would like to use ggplot2 to create the heatmap if that matters.

Here's my code so far:

library(ggplot2)
library(dplyr)
library(magrittr) 

nci <- read.csv('/Users/myname/Desktop/ML Extra Credit/nci.data.csv')
nci.label <-scan(url("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/nci.label",what="")
                 
#Select certain columns (specific years)
mat <- matrix(rexp(200, rate=.1), ncol=20)
rownames(mat) <- paste0('gene',1:nrow(mat))
colnames(mat) <- paste0('sample',1:ncol(mat))
mat[1:5,1:5]

It outputs a sample data frame that looks like this:

    sample1   sample2    sample3   sample4   sample5

gene1 32.278434 16.678512  0.4637713  1.016569  3.353944

gene2  8.719729 11.080337  1.5254223  2.392519  3.503191

gene3  2.199697 18.846487 13.6525699 34.963664  2.511097

gene4  5.860673  2.160185  3.5243884  6.785453  3.947606

gene5 16.363688 38.543575  5.6761373 10.142018 22.481752

Any help would be greatly appreciated!!

Upvotes: 0

Views: 2948

Answers (1)

chemdork123
chemdork123

Reputation: 13793

You will want to get your dataframe in "long" format to facilitate plotting. This is what's called Tidy Data and forms the basis for preparing data to be plotted using ggplot2.

The general idea here is that you need one column for the x value, one column for the y value, and one column to represent the value used for the tile color. There are lots of ways to do this (see melt(), pivot_longer()...), but I like to use tidyr::gather(). Since you're using rownames, instead of a column for gene, I'm first creating that as a column in your dataset.

library(dplyr)
library(tidyr)
library(ggplot2)

set.seed(1234)

# create matrix
mat <- matrix(rexp(200, rate=.1), ncol=20)
rownames(mat) <- paste0('gene',1:nrow(mat))
colnames(mat) <- paste0('sample',1:ncol(mat))
mat[1:5,1:5]

# convert to data.frame and gather
mat <- as.data.frame(mat)
mat$gene <- rownames(mat)
mat <- mat %>% gather(key='sample', value='value', -gene)

The ggplot call is pretty easy. We assign each column to x, y, and fill aesthetics, then use geom_tile() to create the actual heatmap.

ggplot(mat, aes(sample, gene)) + geom_tile(aes(fill=value))

enter image description here

Upvotes: 1

Related Questions