AFoley
AFoley

Reputation: 173

Issue with HDFS command taking 100% cpu

I have a hdfs server where I am currently streaming.

I also hit this server with the following type command regularly to check for certain conditions: hdfs dfs -find /user/cdh/streameddata/ -name *_processed

however, I have started to see this command taking a massive portion of my CPU when monitoring in TOP:

cdh      16919     1 99 13:03 ?        00:43:45 /opt/jdk/bin/java -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/var/log/hadoop -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop -Dhadoop.id.str=cdh -Dhadoop.root.logger=ERROR,DRFA -Djava.library.path=/opt/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Xmx512m -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.fs.FsShell -find /user/cdh/streameddata/ -name *_processed

This is causing other applications to stall, and is having a massive impact on my application on the whole.

My server is a 48 core server, I did not expect this to be an issue.

Currently, I have not set any additional heap in hadoop, so it is using the 1000MB default.

Upvotes: 0

Views: 884

Answers (1)

tk421
tk421

Reputation: 5947

If you think your heap probably is too small, you can run:

 jstat -gcutil 16919 # process ID of the hdfs dfs find command

And look at the value under GCT (Garbage Collection Time) to see how much time you're spending in garbage collection relative to your total run time.

However, if directory /user/cdh/streameddata/ has hundreds of thousands of files or millions of files, you probably are legitimately crippling your system.

Upvotes: 1

Related Questions