Yamuna_dhungana
Yamuna_dhungana

Reputation: 663

How to create a scatter plot of a really big data

I have a really big data of size more than 3 gb (only X and Y variables). The data frame IBD has 300 million rows. Is there a better and faster way to plot this?

First, I read the dataframe:

IBD <- fread("/40/AD/LL_Cohorts_MERGED-IBD.genome", select = c("X", "Y"))

and tried to plot, but it's been over 12 hours and I am not getting any output.

ggplot(IBD, aes(x=X, y=Y))+ 
  geom_point() + 
  ggtitle("ADGC EOAD") + 
  scale_x_continuous(limits=c(0,1)) + 
  scale_y_continuous(limits=c(0,1)) 

Upvotes: 0

Views: 475

Answers (1)

Dylan_Gomes
Dylan_Gomes

Reputation: 2232

One way to get the run time down is to play with smaller datasets and different code to see which would be faster.

Then you can use system.time() to see how long something takes and compare:

Measuring function execution time in R

For example:

size<-100000
IBD<-data.frame(X=rbeta(n = size,shape1=2,shape2 = 2),Y=rbeta(n = size,shape1=2,shape2 = 2))

Using your code on this fake dataset:

system.time(
ggplot(IBD, aes(x=X, y=Y))+ geom_point() + ggtitle("ADGC EOAD") + scale_x_continuous(limits=c(0,1)) + scale_y_continuous(limits=c(0,1))
)

   user  system elapsed 
   0.01    0.00    0.01

Using base plot as a comparison point:

system.time(
plot(Y~X, data=IBD)
)

   user  system elapsed 
   2.13    2.34    4.56 

You can see that plot takes a lot longer. I realize this isn't a solution to making your code faster, but it is a tool that you can use to figure out what would be faster on such a large dataset.


Edit:

Adding in the methods from comments by @maydin:

cluster<-kmeans(x = IBD, centers = 1000)
Clus<-data.frame(cluster$centers)

system.time(
  ggplot(Clus, aes(x=X, y=Y))+ geom_point() + ggtitle("ADGC EOAD") + scale_x_continuous(limits=c(0,1)) + scale_y_continuous(limits=c(0,1))
)

   user  system elapsed 
      0       0       0

Upvotes: 1

Related Questions