Reputation: 5042
Hadoop has configuration parameter hadoop.tmp.dir
which, as per documentation, is `"A base for other temporary directories." I presume, this path refers to local file system.
I set this value to /mnt/hadoop-tmp/hadoop-${user.name}
. After formatting the namenode and starting all services, I see exactly same path created on HDFS.
Does this mean, hadoop.tmp.dir
refers to temporary location on HDFS?
Upvotes: 28
Views: 69964
Reputation: 29347
hadoop.tmp.dir
is Hadoop's temporary directory, it's a local directory (non-HDFS) and as of Hadoop 3.4.0 it is by default (core-default.xml
)
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
Different processes/services use subfolders of hadoop.tmp.dir
for their temporary data.
# cd $HADOOP_HOME;grep -lrH --include="*.xml" "hadoop.tmp.dir"
share/doc/hadoop/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
share/doc/hadoop/hadoop-project-dist/hadoop-common/core-default.xml
share/doc/hadoop/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
share/doc/hadoop/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
All properties depending directly on hadoop.tmp.dir
can be extracted by:
for f in $(grep -lrH --include="*.xml" "hadoop.tmp.dir" $HADOOP_HOME);do
basename $f
xmllint --xpath '/configuration/property[contains(value,"hadoop.tmp.dir")]' $f
echo
done
hdfs-default.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/name</value>
<description>Determines where on the local filesystem the DFS name node
should store the name table(fsimage). If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy. </description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/data</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices. The directories should be tagged
with corresponding storage types ([SSD]/[DISK]/[ARCHIVE]/[RAM_DISK]/[NVDIMM]) for HDFS
storage policies. The default storage type will be DISK if the directory does
not have a storage type tagged explicitly. Directories that do not exist will
be created if local filesystem permission allows.
</description>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file://${hadoop.tmp.dir}/dfs/namesecondary</value>
<description>Determines where on the local filesystem the DFS secondary
name node should store the temporary images to merge.
If this is a comma-delimited list of directories then the image is
replicated in all of the directories for redundancy.
</description>
</property>
core-default.xml
<property>
<name>io.seqfile.local.dir</name>
<value>${hadoop.tmp.dir}/io/local</value>
<description>The local directory where sequence file stores intermediate
data files during merge. May be a comma-separated list of
directories on different devices in order to spread disk i/o.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>fs.s3a.buffer.dir</name>
<value>${env.LOCAL_DIRS:-${hadoop.tmp.dir}}/s3a</value>
<description>Comma separated list of directories that will be used to buffer file
uploads to.
Yarn container path will be used as default value on yarn applications,
otherwise fall back to hadoop.tmp.dir
</description>
</property>
<property>
<name>fs.azure.buffer.dir</name>
<value>${env.LOCAL_DIRS:-${hadoop.tmp.dir}}/abfs</value>
<description>Directory path for buffer files needed to upload data blocks
in AbfsOutputStream.
Yarn container path will be used as default value on yarn applications,
otherwise fall back to hadoop.tmp.dir </description>
</property>
mapred-default.xml
<property>
<name>mapreduce.cluster.local.dir</name>
<value>${hadoop.tmp.dir}/mapred/local</value>
<description>
The local directory where MapReduce stores intermediate
data files. May be a comma-separated list of
directories on different devices in order to spread disk i/o.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>mapreduce.jobhistory.recovery.store.fs.uri</name>
<value>${hadoop.tmp.dir}/mapred/history/recoverystore</value>
<!--value>hdfs://localhost:9000/mapred/history/recoverystore</value-->
<description>The URI where history server state will be stored if
HistoryServerFileSystemStateStoreService is configured as the recovery
storage class.</description>
</property>
<property>
<name>mapreduce.jobhistory.recovery.store.leveldb.path</name>
<value>${hadoop.tmp.dir}/mapred/history/recoverystore</value>
<description>The URI where history server state will be stored if
HistoryServerLeveldbSystemStateStoreService is configured as the recovery
storage class.</description>
</property>
yarn-default.xml
<property>
<description>URI pointing to the location of the FileSystem path where
RM state will be stored. This must be supplied when using
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
as the value for yarn.resourcemanager.store.class</description>
<name>yarn.resourcemanager.fs.state-store.uri</name>
<value>${hadoop.tmp.dir}/yarn/system/rmstore</value>
<!--value>hdfs://localhost:9000/rmstore</value-->
</property>
<property>
<description>Local path where the RM state will be stored when using
org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore
as the value for yarn.resourcemanager.store.class</description>
<name>yarn.resourcemanager.leveldb-state-store.path</name>
<value>${hadoop.tmp.dir}/yarn/system/rmstore</value>
</property>
<property>
<description>List of directories to store localized files in. An
application's localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
</description>
<name>yarn.nodemanager.local-dirs</name>
<value>${hadoop.tmp.dir}/nm-local-dir</value>
</property>
<property>
<description>The local filesystem directory in which the node manager will
store state when recovery is enabled.</description>
<name>yarn.nodemanager.recovery.dir</name>
<value>${hadoop.tmp.dir}/yarn-nm-recovery</value>
</property>
<property>
<description>Store file name for leveldb timeline store.</description>
<name>yarn.timeline-service.leveldb-timeline-store.path</name>
<value>${hadoop.tmp.dir}/yarn/timeline</value>
</property>
<property>
<description>Store file name for leveldb state store.</description>
<name>yarn.timeline-service.leveldb-state-store.path</name>
<value>${hadoop.tmp.dir}/yarn/timeline</value>
</property>
<property>
<description>
The storage path for LevelDB implementation of configuration store,
when yarn.scheduler.configuration.store.class is configured to be
"leveldb".
</description>
<name>yarn.scheduler.configuration.leveldb-store.path</name>
<value>${hadoop.tmp.dir}/yarn/system/confstore</value>
</property>
<property>
<description>
The file system directory to store the configuration files. The path
can be any format as long as it follows hadoop compatible schema,
for example value "file:///path/to/dir" means to store files on local
file system, value "hdfs:///path/to/dir" means to store files on HDFS.
If resource manager HA is enabled, recommended to use hdfs schema so
it works in fail-over scenario.
</description>
<name>yarn.scheduler.configuration.fs.path</name>
<value>file://${hadoop.tmp.dir}/yarn/system/schedconf</value>
</property>
In addition to that, there are second level dependencies such as dfs.namenode.checkpoint.edits.dir
depending on dfs.namenode.checkpoint.dir
:
<property>
<name>dfs.namenode.checkpoint.edits.dir</name>
<value>${dfs.namenode.checkpoint.dir}</value>
<description>Determines where on the local filesystem the DFS secondary
name node should store the temporary edits to merge.
If this is a comma-delimited list of directories then the edits is
replicated in all of the directories for redundancy.
Default value is same as dfs.namenode.checkpoint.dir
</description>
</property>
All properties' default values can be overridden in the corresponding -site.xml
files.
Upvotes: 0
Reputation: 9245
It's confusing, but hadoop.tmp.dir
is used as the base for temporary directories locally, and also in HDFS. The document isn't great, but mapred.system.dir
is set by default to "${hadoop.tmp.dir}/mapred/system"
, and this defines the Path on the HDFS where where the Map/Reduce framework stores system files.
If you want these to not be tied together, you can edit your mapred-site.xml
such that the definition of mapred.system.dir is something that's not tied to ${hadoop.tmp.dir}
Upvotes: 35
Reputation: 5276
Let me add a bit more to kkrugler's answer:
There're three HDFS properties which contain hadoop.tmp.dir
in their values
dfs.name.dir
: directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name
.dfs.data.dir
: directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data
.fs.checkpoint.dir
: directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary
.This is why you saw the /mnt/hadoop-tmp/hadoop-${user.name}
in your HDFS after formatting namenode.
Upvotes: 30
Reputation: 13927
Had a look around for information on this one. Only thing I could come up with was this post on the Amazon Elastic MapReduce Dev Guide:
In hadoop-site.xml, we set hadoop.tmp.dir to /mnt/var/lib/hadoop/tmp. /mnt is where we mount the “extra” EC2 volumes, which can contain a lot more data than the default volume. (The exact amount depends on instance type.) Hadoop's RunJar.java (the module that unpacks the input JARs) interprets hadoop.tmp.dir as a Hadoop file system path rather than a local path, so it writes to the path in HDFS instead of a local path. HDFS is mounted under /mnt (specifically /mnt/var/lib/hadoop/dfs/. So, you can write lots of data to it.
Upvotes: 3