Reputation: 241
I have been seeing intense amount of disk usage on HDFS in last 10 days. As I see in the DataNode hosts on the Hosts tab on Cloudera Manager and Disk Usage charts on HDFS service usage has been almost tripled, ~7TB to ~20TB. At first I was thinking reason for this was something I did wrong in the upgrade I performed to CM and CDH on the 6th of those 10 days but realized it has started to occur before.
I've checked the File Browser on Cloudera Manager first, but saw no difference between size numbers there and before. I also have disk usage reports of last 4 days, they say there has been no increase.
Running hdfs dfsadmin -report
also returns the same.
The dfs folders on Linux confirms the increasing usage but I can't tell what has been changed because there are millions of files and I don't know how to check last modified files in thousands of nested folders. Even if I find them, I can't tell what files are those on HDFS.
Then just recently I have been informed that another user on HDFS has been splitting their large files. They own nearly 2/3 of the all data. Could it cause this much of an increase if they split them into much more that are smaller than HDFS Block Size? If so, why can't I see it on Browser/Reports?
Is there any way to check what folders and files have been modified recently in the HDFS or other things I can check/do? Any suggestion or comment appreciated.
Upvotes: 1
Views: 1188
Reputation: 850
For checking the HDFS activities, Cloudera Navigator provides an excellent information about all the events that was logged in the HDFS.
After logging into Navigator, check for the audits tab. It also allows us to filter the activities such as delete,ipaddress, username and many such things.
The normal search page also provides us to filter the block size ( whether < 256Mb, > 256 Mb) , whether file or directory, the source type, the path, the replication count and many things more.
Upvotes: 0