How to split the input file in Apache Spark

Question

Suppose I have an input file of size 100MB. It contains large number of points (lat-long pair) in CSV format. What should I do in order to split the input file in 10 10MB files in Apache Spark or how do I customize the split.

Note: I want to process a subset of the points in each mapper.

suztomo · Accepted Answer

Spark's abstraction doesn't provide explicit split of data. However you can control the parallelism in several ways.

Assuming you use YARN, HDFS file is automatically split into HDFS blocks and they're processed concurrently when Spark action is running.

Apart from HDFS parallelism, consider using partitioner with PairRDD. PairRDD is data type of RDD of key-value pairs and a partitioner manages mapping from a key to a partition. Default partitioner reads spark.default.parallelism. The partitioner helps to control the distribution of data as well as its locality in PairRDD-specific actions, e.g., reduceByKey.

Take a look at following documentation about Spark data parallelism.

http://spark.apache.org/docs/1.2.0/tuning.html

How to split the input file in Apache Spark

Answers (2)

Related Questions