ecer
ecer

Reputation: 75

Spark write to parquet on hdfs

I have 3 nodes hadoop and spark installed. I would like to take data from rdbms into data frame and write this data into parquet on HDFS. "dfs.replication" value is 1 .

When i try this with following command i have seen all HDFS blocks are located on node which i executed spark-shell.

scala> xfact.write.parquet("hdfs://sparknode01.localdomain:9000/xfact")

Is this the intended behaviour or should all blocks be distributed across the cluster?

Thanks

Upvotes: 4

Views: 12524

Answers (2)

Carrod
Carrod

Reputation: 1

Just as @nik says, I do my work with multi cients and it done for me:

This is the python snippet:

columns = xfact.columns test = sqlContext.createDataFrame(xfact.rdd.map(lambda a: a),columns) test.write.mode('overwrite').parquet('hdfs://sparknode01.localdomain:9000/xfact')

Upvotes: 0

nik
nik

Reputation: 2294

Since you are writing your data to HDFS this does not depend on spark, but on HDFS. From Hadoop : Definitive Guide

Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy).

So yes, this is the intended behaivour.

Upvotes: 3

Related Questions