Reputation: 663
I have a really big data of size more than 3 gb (only X and Y variables). The data frame IBD
has 300 million rows. Is there a better and faster way to plot this?
First, I read the dataframe:
IBD <- fread("/40/AD/LL_Cohorts_MERGED-IBD.genome", select = c("X", "Y"))
and tried to plot, but it's been over 12 hours and I am not getting any output.
ggplot(IBD, aes(x=X, y=Y))+
geom_point() +
ggtitle("ADGC EOAD") +
scale_x_continuous(limits=c(0,1)) +
scale_y_continuous(limits=c(0,1))
Upvotes: 0
Views: 475
Reputation: 2232
One way to get the run time down is to play with smaller datasets and different code to see which would be faster.
Then you can use system.time()
to see how long something takes and compare:
Measuring function execution time in R
For example:
size<-100000
IBD<-data.frame(X=rbeta(n = size,shape1=2,shape2 = 2),Y=rbeta(n = size,shape1=2,shape2 = 2))
Using your code on this fake dataset:
system.time(
ggplot(IBD, aes(x=X, y=Y))+ geom_point() + ggtitle("ADGC EOAD") + scale_x_continuous(limits=c(0,1)) + scale_y_continuous(limits=c(0,1))
)
user system elapsed
0.01 0.00 0.01
Using base plot
as a comparison point:
system.time(
plot(Y~X, data=IBD)
)
user system elapsed
2.13 2.34 4.56
You can see that plot
takes a lot longer. I realize this isn't a solution to making your code faster, but it is a tool that you can use to figure out what would be faster on such a large dataset.
Adding in the methods from comments by @maydin:
cluster<-kmeans(x = IBD, centers = 1000)
Clus<-data.frame(cluster$centers)
system.time(
ggplot(Clus, aes(x=X, y=Y))+ geom_point() + ggtitle("ADGC EOAD") + scale_x_continuous(limits=c(0,1)) + scale_y_continuous(limits=c(0,1))
)
user system elapsed
0 0 0
Upvotes: 1