user125687
user125687

Reputation: 85

Loading a very big csv file by using apache spark

I need to load huge csv files by using Apache Spark.

So far, I loaded different files by using Apache Spark's read method. I did not face any problem. However, the file sizes were not big, they were around 100 megabytes.

Now I received some scalability questions like: "What happens if the file does not fit into driver's memory?"

How does the spark.read method work? Is it loading the csv file into the driver's (master node) memory? I would appreciate any idea, experience or documentation.

sample code:

df = spark.read.format("csv").option("header","true").load("hugecsvfile.csv")

Upvotes: 0

Views: 3043

Answers (2)

bhavin
bhavin

Reputation: 120

from the code sample you posted it seeems the hugecsvfile.csv is already in the master node . but on disk.

so spark will read you file and send data to the core nodes in the cluster. Spark automatically spill data on disk on those core nodes if required. - you can explicitly tell it to cache the computation on disk but if you dont then it will be recomputed on the file.

spark only brings data to master node's memory ( doesn't spill to disk on master node ) when you execute an action.

Upvotes: 1

Steven
Steven

Reputation: 15258

This code does not load the file in memory. It will read the file once to define the schema but that's all. It is better to provide the schema otherwise, it will be long just to define it. At least, you could set some option so it reads only a portion of your file.

After that, any transformation/action will be executed on chunk of your file.

Upvotes: 1

Related Questions