Putting file into HDFS using docker-compose

Question

Is there a way to put some file, let's say data.json, into HDFS automatically right from Docker-compose/Dockerfile? When I start namenode and datanode I can enter into containers with

docker exec -it namenode [datanode] bash, and use

hdfs dfs -put data.json hdfs:/ (when safe mode is finished)

and that works, but I need a way to run this automatically. When I try to build containers from Dockerfile and put comands:

FROM bde2020/hadoop-namenode:1.1.0-hadoop2.8-java8
WORKDIR /data
ADD hdfs_writer/data.json /data
# ADD python_script.py /data

CMD ["hdfs dfsadmin -safemode wait && hdfs dfs -put ./data.json hdfs:/"]

# CMD ["python python_script.py"]

Container namenode immediately terminates. I also tried with the python script, that I add to container and run it with CMD.

python_script

import time
import os

os.system("hdfs dfsadmin -safemode wait")
os.system("hdfs dfs -put -f data.json hdfs:/")

while True:
    time.sleep(5)

in that case, container is running, but if I check logs and try to list hdfs with hdfs dfs -ls hdfs:/, there is following error

safemode: Call From 662aae005e8b/172.20.0.5 to namenode:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
19/04/18 14:36:36 WARN ipc.Client: Failed to connect to server: namenode/172.20.0.5:8020: try once and fail.

I read recommended link from error log, and to be honest, I am not sure that I understand what should I do.

Any your suggestions or ideas about possible solution is highly valuable for me, as I am new to this field and I don't have much experience.
If you need some more info, I will be happy to provide it.

docker-compose.yml (just part of it)

  namenode:
    #docker-compose.yml and Dockerfile are in the dame directory
    build: .                    
    volumes:
      - ./data/namenode:/hadoop/dfs/name
    environment:
      - CLUSTER_NAME=cluster
    env_file:
      - ./hadoop.env
    ports:
      - 50070:50070
  datanode:
    image: bde2020/hadoop-datanode:1.1.0-hadoop2.8-java8
    depends_on: 
      - namenode
    volumes:
      - ./data/datanode:/hadoop/dfs/data
    env_file:
      - ./hadoop.env

hadoop.env

CORE_CONF_fs_defaultFS=hdfs://namenode:8020
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_blocksize=1m

YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031

David Maze · Accepted Answer

You can't write to networked services in a Dockerfile. Imagine running docker build, running your combined application, tearing it down, and running it again. You'll reuse the same built image without re-running the Dockerfile steps; only the content in the image itself is kept. In most cases you need some minor amount of setup to communicate between services (Docker Compose can do this for you) but that is not set up during a build sequence. This is the same answer as "you can't run database migrations from a Dockerfile", but it applies equally to Hadoop.

A container only does one thing. Your sample Dockerfile sets a different CMD that waits for the namenode to be running and sets it up. This happens instead of starting the namenode process. A Docker container runs one main command and one main command only; there is not a way to run a main command and also a side support script of some form. The container you show would probably work, but you'd need to run it as a separate container alongside the namenode container.

You don't need to be "in Docker" to access Docker-hosted services. You can use a Docker Compose ports: directive to make services visible to the host, at which point you can use ordinary clients to interact with them. The docker exec path is the equivalent of "I ssh to my server as root, and then...", which isn't how you normally deal with any service at all.

Your server containers should only run servers. In your example you're both trying to launch an HDFS namenode and also populate the server from the same container; you'd be better off having the namenode container only be the namenode and running the setup job from another container or from the host. (See the standard postgres image's entrypoint script for some idea of the gyrations required otherwise.)

Docker Compose isn't great for one-off jobs. Every time you run docker-compose up it will discover that your setup container isn't running and try to start it again. Other more powerful orchestrators could be a better fit; for example, a Kubernetes Job is a reasonable fit for what you're describing.

Putting file into HDFS using docker-compose

Answers (1)

Related Questions