VenVig
VenVig

Reputation: 915

How to persist HDFS data in docker container

I have a docker image for hadoop. (in my case it is https://github.com/kiwenlau/hadoop-cluster-docker, but the question applies to any hadoop docker image)

I am running the docker container as below..

sudo docker run -itd --net=hadoop --user=root -p 50070:50070 \
-p 8088:8088 -p 9000:9000 --name hadoop-master --hostname hadoop-master \
kiwenlau/hadoop

I am writing data to the hdfs file system from java running in the host ubuntu machine.

FileSystem hdfs = FileSystem.get(new URI(hdfs://0.0.0.0:9000"), configuration)
hdfs.create(new Path("hdfs://0.0.0.0:9000/user/root/input/NewFile.txt")),

How should I mount the volume when starting docker such that the "NewFile1.txt" is persisted.

Which "path" inside the container corresponds to the HDFS path "/user/root/input/NewFile.txt" ?

Upvotes: 2

Views: 5133

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191983

You should inspect the dfs.datanode.data.dir in the hdfs-site.xml file to know where data is stored to the container filesystem

<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///root/hdfs/datanode</value>
    <description>DataNode directory</description>
</property>

Without this file/property, the default location would be in file:///tmp/hadoop-${user.name}/dfs/data

For docker,. mind that the default user that runs the processes is the root user.

You will also need to persist the namenode files, again seen from that XML file

Which "path" inside the container corresponds to the HDFS path "/user/root/input/NewFile.txt"

The container path holds the blocks of the HDFS file, not the whole file itself

Upvotes: 4

Related Questions