Reputation:
I am a beginner and want to learn about spark. I am working with spark-shell and doing some experiment to get fast results I want to get the results from the spark worker nodes.
I have total two machines and in that, I have a driver and one worker on a single machine and one another worker on the other machine.
when I am want to get the count the result is not from both nodes. I have a JSON file to read and doing some performance checking.
here is the code :
spark-shell --conf spark.sql.warehouse.dir=C:\spark-warehouse --master spark://192.168.0.31:7077
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfs = sqlContext.read.json("file:///C:/order.json")
dfs.count
I have the order.JSON file is distributed on both machines. but then also I am getting different output
Upvotes: 3
Views: 132
Reputation: 1214
1.If you are running your spark on different nodes, then you must have S3 or HDFS path, make sure each node could access your data source.
val dfs = sqlContext.read.json("file:///C:/order.json")
Change to
val dfs = sqlContext.read.json("HDFS://order.json")
2.If your data sources are pretty small then you can try to use Spark broadcast for share those data to other nodes, then each node have consistent data.https://spark.apache.org/docs/latest/rdd-programming-guide.html#shared-variables
3.In order to print log your logs in console please configure your log4j file in your spark conf folder. details access Override Spark log4j configurations
Upvotes: 2