user8443296
user8443296

Reputation:

Apache spark does not giving correct output

I am a beginner and want to learn about spark. I am working with spark-shell and doing some experiment to get fast results I want to get the results from the spark worker nodes.

I have total two machines and in that, I have a driver and one worker on a single machine and one another worker on the other machine.

when I am want to get the count the result is not from both nodes. I have a JSON file to read and doing some performance checking.

here is the code :

spark-shell --conf spark.sql.warehouse.dir=C:\spark-warehouse --master spark://192.168.0.31:7077
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val dfs = sqlContext.read.json("file:///C:/order.json")
dfs.count

I have the order.JSON file is distributed on both machines. but then also I am getting different output

Upvotes: 3

Views: 132

Answers (1)

SharpLu
SharpLu

Reputation: 1214

1.If you are running your spark on different nodes, then you must have S3 or HDFS path, make sure each node could access your data source.

val dfs = sqlContext.read.json("file:///C:/order.json")

Change to

val dfs = sqlContext.read.json("HDFS://order.json")

2.If your data sources are pretty small then you can try to use Spark broadcast for share those data to other nodes, then each node have consistent data.https://spark.apache.org/docs/latest/rdd-programming-guide.html#shared-variables

3.In order to print log your logs in console please configure your log4j file in your spark conf folder. details access Override Spark log4j configurations

Upvotes: 2

Related Questions