Show how a parquet file is replicated and stored on HDFS

Question

Data stored in parquet format results in a folder with many small files on HDFS.

Is there a way to view how those files are replicated in HDFS (on which nodes)?

Thanks in advance.

eliasah · Accepted Answer

If I understand your question correctly, you actually want to track which data blocks is on which data node and that's not apache-spark specific.

You can use hadoop fsck command as followed :

hadoop fsck  -files -blocks -locations

This will print out locations for every block in the specified path.

Answers (1)