Reputation: 309
I am trying to load a dataset with a million rows and 1000 columns with sparklyr. I am running Spark on a very big cluster at work. Still the size of the data seems to be too big. I have tried two different approaches:
This is the dataset: (train_numeric.csv) https://www.kaggle.com/c/bosch-production-line-performance/data
1) - Put .csv into hdfs - spark_read_csv(spark_context, path)
2) - read the csv file in as a regular R dataframe - spark_frame<-copy_to(sc,R-dataframe)
Both ways work perfectly fine on a subset of the dataset, but fail when I try to read the entire dataset.
Is anybody aware of a method that is suitable for large datasets?
Thanks, Felix
Upvotes: 0
Views: 1864
Reputation: 1530
The question is - do you need to read the entire data set into the memory?
First of all - note that Spark evaluates transformations lazily. Setting spark_read_csv memory parameter to FALSE would make Spark map the file, but not make a copy of it in memory. The whole calculation will take place only as soon as collect() is called.
spark_read_csv(sc, "flights_spark_2008", "2008.csv.bz2", memory = FALSE)
So consider cutting down on the rows and columns before doing any calculations and getting the results back to R as in the example below:
http://spark.rstudio.com/examples-caching.html#process_on_the_fly
Upvotes: 2