Reputation: 5764
Data stored in parquet format results in a folder with many small files on HDFS.
Is there a way to view how those files are replicated in HDFS (on which nodes)?
Thanks in advance.
Upvotes: 0
Views: 359
Reputation: 40380
If I understand your question correctly, you actually want to track which data blocks is on which data node and that's not apache-spark specific.
You can use hadoop fsck command as followed :
hadoop fsck <path> -files -blocks -locations
This will print out locations for every block in the specified path.
Upvotes: 2