Felix
Felix

Reputation: 309

sparklyr for big csv file

I am trying to load a dataset with a million rows and 1000 columns with sparklyr. I am running Spark on a very big cluster at work. Still the size of the data seems to be too big. I have tried two different approaches:

This is the dataset: (train_numeric.csv) https://www.kaggle.com/c/bosch-production-line-performance/data

1) - Put .csv into hdfs - spark_read_csv(spark_context, path)

2) - read the csv file in as a regular R dataframe - spark_frame<-copy_to(sc,R-dataframe)

Both ways work perfectly fine on a subset of the dataset, but fail when I try to read the entire dataset.

Is anybody aware of a method that is suitable for large datasets?

Thanks, Felix

Upvotes: 0

Views: 1864

Answers (1)

michalrudko
michalrudko

Reputation: 1530

The question is - do you need to read the entire data set into the memory?

First of all - note that Spark evaluates transformations lazily. Setting spark_read_csv memory parameter to FALSE would make Spark map the file, but not make a copy of it in memory. The whole calculation will take place only as soon as collect() is called.

spark_read_csv(sc, "flights_spark_2008", "2008.csv.bz2", memory = FALSE)

So consider cutting down on the rows and columns before doing any calculations and getting the results back to R as in the example below:

http://spark.rstudio.com/examples-caching.html#process_on_the_fly

Upvotes: 2

Related Questions