Reputation: 9622
I have a gzip compressed file file.gz
with 4,726,276 lines where the first and last five lines look like this:
FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
CAN -1 CAN 1 OT 0 1.0000 0.0000 0.0000 0.0000 -1 0.745118 0.1111 1.5526
CAN -1 CAN 2 OT 0 0.8761 0.1239 0.0000 0.0619 -1 0.752607 0.0648 1.4615
CAN -1 CAN 3 OT 0 0.8810 0.1190 0.0000 0.0595 -1 0.753934 0.3058 1.7941
CAN -1 CAN 4 OT 0 0.8911 0.1089 0.0000 0.0545 -1 0.751706 0.8031 2.4138
WAN 2 WAN 4 OT 0 0.8410 0.0000 0.1590 0.1590 -1 0.787251 0.0840 1.5000
WAN 2 WAN 5 OT 0 0.8606 0.0000 0.1394 0.1394 -1 0.784882 0.7671 2.3571
WAN 3 WAN 4 OT 0 0.8306 0.0000 0.1694 0.1694 -1 0.790142 0.0392 1.3846
WAN 3 WAN 5 OT 0 0.7960 0.0364 0.1676 0.1858 -1 0.795924 0.5000 2.0000
WAN 4 WAN 5 OT 0 0.8227 0.0090 0.1683 0.1728 -1 0.793460 0.5577 2.0645
The x-value is column 1 + 2. The y-value is column 3 + 4. The z-value is column 10. Values along the diagonal are not present in the input file. They should preferably be 1, but 0 is also fine.
How can I create a heat map from such data?
Here is a simple example for a 3x3 matrix:
FID1 IID1 FID2 IID2 PI_HAT
A 1 B 1 0.1
A 1 B 2 0.2
B 1 B 2 0.3
Upvotes: 1
Views: 977
Reputation: 23200
Your question seems to have 2 parts:
At first blush, it appeared that you were implying that the size of the data was large -- there are many resources on how to use Big Data in R (here's one) -- however based on the comments I take it that the data size is actually not an issue. If it were then your options would depend in part on your hardware resources as well as your willingness to sample data (which I highly recommend) rather than use every single one of your 5 million rows. The Central Limit Theorem is your friend.
You can read in gzip
data like this:
data <- read.table(gzfile("file.gz"),header=T, sep="\t", stringsAsFactors=F)
Since you did not provide your compressed archive, I've copied your sample data and read it from my clipboard in the code below. I'll show you how to construct a heatmap from this data; for importing from gzip and handling Big Data check out the link provided above.
require(stats)
require(fields)
require(akima)
a <- read.table(con <- file("clipboard"), header = T)
a$x1 <- as.numeric(a[,1])
a$x2 <- as.numeric(a[,2])
a$y1 <- as.numeric(a[,3])
a$y2 <- as.numeric(a[,4])
x <- as.matrix(cbind(a$x1, a$x2))
y <- as.matrix(cbind(a$y1, a$y2))
z <- as.matrix(a[, 10])
s = smooth.2d(z, x=cbind(x,y), theta=0.5)
image.plot(s)
Upvotes: 1
Reputation: 37879
This is a ggplot2
approach. 4.5m rows shouldn't be a problem in R.
df <- read.table(text='FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
CAN -1 CAN 1 OT 0 1.0000 0.0000 0.0000 0.0000 -1 0.745118 0.1111 1.5526
CAN -1 CAN 2 OT 0 0.8761 0.1239 0.0000 0.0619 -1 0.752607 0.0648 1.4615
CAN -1 CAN 3 OT 0 0.8810 0.1190 0.0000 0.0595 -1 0.753934 0.3058 1.7941
CAN -1 CAN 4 OT 0 0.8911 0.1089 0.0000 0.0545 -1 0.751706 0.8031 2.4138
CAN -1 CAN 4 OT 0 0.8911 0.1089 0.0000 0.0545 -1 0.751706 0.8031 2.4138
CAN -1 CAN 4 OT 0 0.8911 0.1089 0.0000 0.0545 -1 0.751706 0.8031 2.4138
WAN 3 WAN 4 OT 0 0.8306 0.0000 0.1694 0.1694 -1 0.790142 0.0392 1.3846
WAN 3 WAN 4 OT 0 0.8306 0.0000 0.1694 0.1694 -1 0.790142 0.0392 1.3846
WAN 2 WAN 4 OT 0 0.8410 0.0000 0.1590 0.1590 -1 0.787251 0.0840 1.5000
WAN 2 WAN 5 OT 0 0.8606 0.0000 0.1394 0.1394 -1 0.784882 0.7671 2.3571
WAN 3 WAN 4 OT 0 0.8306 0.0000 0.1694 0.1694 -1 0.790142 0.0392 1.3846
WAN 3 WAN 5 OT 0 0.7960 0.0364 0.1676 0.1858 -1 0.795924 0.5000 2.0000
WAN 4 WAN 5 OT 0 0.8227 0.0090 0.1683 0.1728 -1 0.793460 0.5577 2.0645', header=T)
I added a few lines in your output to make it more reasonable in a heatmap. There was no overlap previously:
#create your variables by merging columns 1+2 and 3+4
a <- mapply(paste,df[[1]], df[[2]])
b <- mapply(paste,df[[3]], df[[4]])
#combine in a data.frame
df2 <- data.frame(a,b)
library(dplyr)
#aggregate because you will need aggregated rows for this to work
#this should only take a few seconds for 4.5m rows
df3 <-
df2 %>%
group_by(a,b) %>%
summarize(total=n())
#plot with ggplot2
library(ggplot2)
ggplot(df3, aes(x=a,y=b,fill=total)) + geom_tile()
Output:
Upvotes: 2