user3243499
user3243499

Reputation: 3151

why does _spark_metadata has all parquet partitioned files inside 0 but cluster having 2 workers?

I have a small spark cluster with one master and two workers. I have a Kafka streaming app which streams data from Kafka and writes to a directory in parquet format and in append mode.

So far I am able to read from Kafka stream and write it to a parquet file using the following key line.

val streamingQuery = mydf.writeStream.format("parquet").option("path", "/root/Desktop/sampleDir/myParquet").outputMode(OutputMode.Append).option("checkpointLocation", "/root/Desktop/sampleDir/myCheckPoint").start()

I have checked in both of the workers. There are 3-4 snappy parquet files got created with file names having prefix as part-00006-XXX.snappy.parquet.

But when I try to read this parquet file using following command:

val dfP = sqlContext.read.parquet("/root/Desktop/sampleDir/myParquet")

it is showing file not found exceptions for some of the parquet split files. Strange thing is that, those files are already present in the one of the worker nodes.

When further checked in the logs, it is obeserved that spark is trying to get all the parquet files from only ONE worker nodes, and since not all parquet files are present in one worker, it is hitting with the exception that those files were not found in the mentioned path to parquet.

Am I missing some critical step in the streaming query or while reading data?

NOTE: I don't have a HADOOP infrastructure. I want to use filesystem only.

Upvotes: 0

Views: 536

Answers (1)

Assaf Mendelson
Assaf Mendelson

Reputation: 13001

You need a shared file system.

Spark assumes the same file system is visible from all nodes (driver and workers). If you are using the basic file system then each node sees their own file system which is different than the file system of other nodes.

HDFS is one way of getting a common, shared file system, another would be to use a common NFS mount (i.e. mount the same remote file system from all nodes to the same path). Other shared file systems also exist.

Upvotes: 1

Related Questions