How to persist HDFS data in docker container

Question

I have a docker image for hadoop. (in my case it is https://github.com/kiwenlau/hadoop-cluster-docker, but the question applies to any hadoop docker image)

I am running the docker container as below..

sudo docker run -itd --net=hadoop --user=root -p 50070:50070 \
-p 8088:8088 -p 9000:9000 --name hadoop-master --hostname hadoop-master \
kiwenlau/hadoop

I am writing data to the hdfs file system from java running in the host ubuntu machine.

FileSystem hdfs = FileSystem.get(new URI(hdfs://0.0.0.0:9000"), configuration)
hdfs.create(new Path("hdfs://0.0.0.0:9000/user/root/input/NewFile.txt")),

How should I mount the volume when starting docker such that the "NewFile1.txt" is persisted.

Which "path" inside the container corresponds to the HDFS path "/user/root/input/NewFile.txt" ?

OneCricketeer · Accepted Answer

You should inspect the dfs.datanode.data.dir in the hdfs-site.xml file to know where data is stored to the container filesystem


    dfs.datanode.data.dir
    file:///root/hdfs/datanode
    DataNode directory

Without this file/property, the default location would be in file:///tmp/hadoop-${user.name}/dfs/data

For docker,. mind that the default user that runs the processes is the root user.

You will also need to persist the namenode files, again seen from that XML file

Which "path" inside the container corresponds to the HDFS path "/user/root/input/NewFile.txt"

The container path holds the blocks of the HDFS file, not the whole file itself

How to persist HDFS data in docker container

Answers (1)

Related Questions