tommy.carstensen
tommy.carstensen

Reputation: 9622

Creating heat map with R from a square matrix

I have a gzip compressed file file.gz with 4,726,276 lines where the first and last five lines look like this:

 FID1 IID1 FID2 IID2 RT    EZ      Z0      Z1      Z2  PI_HAT PHE       DST     PPC   RATIO
 CAN  -1 CAN   1 OT     0  1.0000  0.0000  0.0000  0.0000  -1  0.745118  0.1111  1.5526
 CAN  -1 CAN   2 OT     0  0.8761  0.1239  0.0000  0.0619  -1  0.752607  0.0648  1.4615
 CAN  -1 CAN   3 OT     0  0.8810  0.1190  0.0000  0.0595  -1  0.753934  0.3058  1.7941
 CAN  -1 CAN   4 OT     0  0.8911  0.1089  0.0000  0.0545  -1  0.751706  0.8031  2.4138

 WAN   2 WAN   4 OT     0  0.8410  0.0000  0.1590  0.1590  -1  0.787251  0.0840  1.5000
 WAN   2 WAN   5 OT     0  0.8606  0.0000  0.1394  0.1394  -1  0.784882  0.7671  2.3571
 WAN   3 WAN   4 OT     0  0.8306  0.0000  0.1694  0.1694  -1  0.790142  0.0392  1.3846
 WAN   3 WAN   5 OT     0  0.7960  0.0364  0.1676  0.1858  -1  0.795924  0.5000  2.0000
 WAN   4 WAN   5 OT     0  0.8227  0.0090  0.1683  0.1728  -1  0.793460  0.5577  2.0645

The x-value is column 1 + 2. The y-value is column 3 + 4. The z-value is column 10. Values along the diagonal are not present in the input file. They should preferably be 1, but 0 is also fine.

How can I create a heat map from such data?

Here is a simple example for a 3x3 matrix:

FID1 IID1 FID2 IID2 PI_HAT
A 1 B 1 0.1
A 1 B 2 0.2
B 1 B 2 0.3

Upvotes: 1

Views: 977

Answers (2)

Hack-R
Hack-R

Reputation: 23200

Your question seems to have 2 parts:

  1. How to handle the data in R (in this case coming from a gzip compressed archive)
  2. How to make a heatmap

At first blush, it appeared that you were implying that the size of the data was large -- there are many resources on how to use Big Data in R (here's one) -- however based on the comments I take it that the data size is actually not an issue. If it were then your options would depend in part on your hardware resources as well as your willingness to sample data (which I highly recommend) rather than use every single one of your 5 million rows. The Central Limit Theorem is your friend.

You can read in gzip data like this:

data <- read.table(gzfile("file.gz"),header=T, sep="\t", stringsAsFactors=F)

Since you did not provide your compressed archive, I've copied your sample data and read it from my clipboard in the code below. I'll show you how to construct a heatmap from this data; for importing from gzip and handling Big Data check out the link provided above.

require(stats)
require(fields)
require(akima)

a <- read.table(con <- file("clipboard"), header = T)

a$x1 <- as.numeric(a[,1])
a$x2 <- as.numeric(a[,2])
a$y1 <- as.numeric(a[,3])
a$y2 <- as.numeric(a[,4])
x <- as.matrix(cbind(a$x1, a$x2))
y <- as.matrix(cbind(a$y1, a$y2))
z <- as.matrix(a[, 10])

s = smooth.2d(z, x=cbind(x,y), theta=0.5)
image.plot(s)

enter image description here

Upvotes: 1

LyzandeR
LyzandeR

Reputation: 37879

This is a ggplot2 approach. 4.5m rows shouldn't be a problem in R.

df <- read.table(text='FID1 IID1 FID2 IID2 RT    EZ      Z0      Z1      Z2  PI_HAT PHE       DST     PPC   RATIO
 CAN  -1 CAN   1 OT     0  1.0000  0.0000  0.0000  0.0000  -1  0.745118  0.1111  1.5526
 CAN  -1 CAN   2 OT     0  0.8761  0.1239  0.0000  0.0619  -1  0.752607  0.0648  1.4615
 CAN  -1 CAN   3 OT     0  0.8810  0.1190  0.0000  0.0595  -1  0.753934  0.3058  1.7941
 CAN  -1 CAN   4 OT     0  0.8911  0.1089  0.0000  0.0545  -1  0.751706  0.8031  2.4138
 CAN  -1 CAN   4 OT     0  0.8911  0.1089  0.0000  0.0545  -1  0.751706  0.8031  2.4138
 CAN  -1 CAN   4 OT     0  0.8911  0.1089  0.0000  0.0545  -1  0.751706  0.8031  2.4138
 WAN   3 WAN   4 OT     0  0.8306  0.0000  0.1694  0.1694  -1  0.790142  0.0392  1.3846
 WAN   3 WAN   4 OT     0  0.8306  0.0000  0.1694  0.1694  -1  0.790142  0.0392  1.3846
 WAN   2 WAN   4 OT     0  0.8410  0.0000  0.1590  0.1590  -1  0.787251  0.0840  1.5000
 WAN   2 WAN   5 OT     0  0.8606  0.0000  0.1394  0.1394  -1  0.784882  0.7671  2.3571
 WAN   3 WAN   4 OT     0  0.8306  0.0000  0.1694  0.1694  -1  0.790142  0.0392  1.3846
 WAN   3 WAN   5 OT     0  0.7960  0.0364  0.1676  0.1858  -1  0.795924  0.5000  2.0000
 WAN   4 WAN   5 OT     0  0.8227  0.0090  0.1683  0.1728  -1  0.793460  0.5577  2.0645', header=T)

I added a few lines in your output to make it more reasonable in a heatmap. There was no overlap previously:

#create your variables by merging columns 1+2 and 3+4
a <- mapply(paste,df[[1]], df[[2]])
b <- mapply(paste,df[[3]], df[[4]])
#combine in a data.frame
df2 <- data.frame(a,b)


library(dplyr)
#aggregate because you will need aggregated rows for this to work
#this should only take a few seconds for 4.5m rows
df3 <-
df2 %>%
  group_by(a,b) %>%
  summarize(total=n())

#plot with ggplot2
library(ggplot2)
ggplot(df3, aes(x=a,y=b,fill=total)) + geom_tile()

Output:

enter image description here

Upvotes: 2

Related Questions