Reputation: 842
I am new to scala and spark. I am trying to join two RDDs coming from two different text files. In each text file there two columns separated by tab, e.g.
text1 text2
100772C111 ion 200772C222 ion
100772C111 on 200772C222 gon
100772C111 n 200772C2 n
So I want to join these two files based their second columns and get a result as below meaning that there are 2 common terms for given those two documents:
((100772C111-200772C222,2))
My computer's features:
4 X (intel(r) core(tm) i5-2430m cpu @2.40 ghz)
8 GB RAM
My script:
import org.apache.spark.{SparkConf, SparkContext}
object hw {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "C:\\spark-1.4.1\\winutils")
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext(conf)
val emp = sc.textFile("S:\\Staff_files\\Mehmet\\Projects\\SPARK - `scala\\wos14.txt")
.map { line => val parts = line.split("\t")((parts(5)),parts(0))}
val emp_new = sc.textFile("C:\\WHOLE_WOS_TEXT\\fwo_word.txt")
.map{ line2 => val parts = line2.split("\t")
((parts(3)),parts(1)) }
val finalemp = emp_new.distinct().join(emp.distinct())
.map{case((nk1), ((parts1),(val1))) => (parts1 + "-" + val1, 1)}
.reduceByKey((a, b) => a + b)
finalemp.foreach(println)
}
}
This code gives what I want when I try with text files in smaller sizes. However, I want to implement this script for big text files. I have one text file with a size of 110 KB (approx. 4M rows) and another one 9 gigabyte (more than 1B rows).
When I run my script employing these two text files, I observed on the log screen following:
15/09/04 18:19:06 INFO TaskSetManager: Finished task 177.0 in stage 1.0 (TID 181) in 9435 ms on localhost (178/287)
15/09/04 18:19:06 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:5972688896+33554432
15/09/04 18:19:15 INFO Executor: Finished task 178.0 in stage 1.0 (TID 182). 2293 bytes result sent to driver
15/09/04 18:19:15 INFO TaskSetManager: Starting task 179.0 in stage 1.0 (TID 183, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:15 INFO Executor: Running task 179.0 in stage 1.0 (TID 183)
15/09/04 18:19:15 INFO TaskSetManager: Finished task 178.0 in stage 1.0 (TID 182) in 9829 ms on localhost (179/287)
15/09/04 18:19:15 INFO HadoopRDD: Input split: file:/S:/Staff_files/Mehmet/Projects/SPARK - scala/wos14.txt:6006243328+33554432
15/09/04 18:19:25 INFO Executor: Finished task 179.0 in stage 1.0 (TID 183). 2293 bytes result sent to driver
15/09/04 18:19:25 INFO TaskSetManager: Starting task 180.0 in stage 1.0 (TID 184, localhost, PROCESS_LOCAL, 1422 bytes)
15/09/04 18:19:25 INFO Executor: Running task 180.0 in stage 1.0 (TID 184)
...
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (13 times so far)
15/09/04 18:37:49 INFO BlockManagerInfo: Removed broadcast_2_piece0 on `localhost:64567 in memory (size: 2.2 KB, free: 969.8 MB)
15/09/04 18:37:49 INFO ExternalSorter: Thread 101 spilling in-memory map of 5.3 MB to disk (14 times so far)...
So is it reasonable to process such text files in local? After waiting more than 3 hours the program was still spilling data to the disk.
To sum up, is there something that I need to change in my code to cope with the performance issues?
Upvotes: 3
Views: 1475
Reputation: 28728
Are you giving Spark enough memory? It's not entirely obvious, but by default Spark starts with very small memory allocation. It won't use as much memory as it can eat like, say, an RDMS. You need to tell it how much you want it to use.
The default is (I believe) one executor per node, and 512MB of RAM per executor. You can scale this up very easily:
spark-shell --driver-memory 1G --executor-memory 1G --executor-cores 3 --num-executors 3
More settings here: http://spark.apache.org/docs/latest/configuration.html#application-properties
You can see how much memory is allocated to the Spark environment and each executor on the SparkUI, which (by default) is at http://localhost:4040
Upvotes: 1