Reputation: 173
I have a hdfs server where I am currently streaming.
I also hit this server with the following type command regularly to check for certain conditions: hdfs dfs -find /user/cdh/streameddata/ -name *_processed
however, I have started to see this command taking a massive portion of my CPU when monitoring in TOP:
cdh 16919 1 99 13:03 ? 00:43:45 /opt/jdk/bin/java -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop -Dhadoop.id.str=cdh -Dhadoop.root.logger=ERROR,DRFA -Djava.library.path=/opt/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.fs.FsShell -find /user/cdh/streameddata/ -name *_processed
This is causing other applications to stall, and is having a massive impact on my application on the whole.
My server is a 48 core server, I did not expect this to be an issue.
Currently, I have not set any additional heap in hadoop, so it is using the 1000MB default.
Upvotes: 0
Views: 884
Reputation: 5947
If you think your heap probably is too small, you can run:
jstat -gcutil 16919 # process ID of the hdfs dfs find command
And look at the value under GCT
(Garbage Collection Time) to see how much time you're spending in garbage collection relative to your total run time.
However, if directory /user/cdh/streameddata/ has hundreds of thousands of files or millions of files, you probably are legitimately crippling your system.
Upvotes: 1