Wordcount in a large file using Spark

Question

I have a question about how I can work on large files using Spark. Let's say I have a really large file (1 TB) while I only have access to 500GB RAM in my cluster. A simple wordcount application would look like the follows:

sc.textfile(path_to_file).flatmap(split_line_to_words).map(lambda x: (x,1)).reduceByKey()

When I do not have access to enough memory, will the above application fail due to OOM? if so, what are some ways I can fix this?

Ged · Accepted Answer

Well, this is not an issue.

N Partitions equal to block size of HDFS (like) file system will be created on Worker Nodes at some stage physically- resulting in many N small tasks to execute, easily fitting inside the 500GB, over the life of the Spark App.

Partitions and its task equivalent will run concurrently, based on how many executors you have allocated. If you have, say, M executors with 1 core, then max M tasks run concurrently. Depends also on scheduling and resource allocation mode.

Spark handles like any OS as it were, situations of size and resources and depending on resources available, more or less can be done. The DAG Scheduler plays a role in all this. But keeping it simple here.

Wordcount in a large file using Spark

Answers (1)

Related Questions