aka.ecc
aka.ecc

Reputation: 193

How can I access to HDFS file system in the latest Tensorflow 2.6.0?

I recently upgraded the tensorflow version used in my program to the recently released 2.6.0, but I ran into a trouble.

import tensorflow as tf

pattern = 'hdfs://mypath'
print(tf.io.gfile.glob(pattern))

The above API throws an exception in version 2.6:

tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme'hdfs' not implemented (file:xxxxx)

Then I checked the relevant implementation code and found that the official recommendation is to use tensorflow/io to access hdfs, and the environment variable TF_USE_MODULAR_FILESYSTEM is provided to use legacy access support. Since my code is more complex and difficult to refactor in a short time, I tried to use this environment variable, but it still failed.

In general, my questions are:

  1. In the latest version of tensorflow, if "tfio" is not used, how can I still access the HDFS file?
  2. If "tfio" must be used, what is the equivalent code call to tf.io.gfile.glob?

Upvotes: 1

Views: 3756

Answers (1)

aka.ecc
aka.ecc

Reputation: 193

TL.DR. Install tensorflow-io and import it.

After some tossing, I found a solution (it may be the official recommended way):

Since v2.6.0, Tensorflow no longer provides HDFS, GCS and other file system support in the framework, but transfers these support to the Tensorflow/IO project.

Therefore, in future versions, to have the support of HDFS, GCS and other file systems, you only need to install tensorflow-io and import it to the training program:

$ pip install tensorflow-io

$ cat test.py
import tensorflow as tf
import tensorflow_io as tfio

print(tf.io.gfile.glob('hdfs://...'))

$ CLASSPATH=$(${HADOOP_HOME}/bin/hadoop classpath --glob) python test.py

It will load libtensorflow_io.so and libtensorflow_io_plugins.so, which contains the implementation and registration logic of each extras file system:

# tensorflow_io/python/ops/__init__.py
core_ops = LazyLoader("core_ops", "libtensorflow_io.so")
try:
    plugin_ops = _load_library("libtensorflow_io_plugins.so", "fs")
except NotImplementedError as e:
    warnings.warn("unable to load libtensorflow_io_plugins.so: {}".format(e))
    # Note: load libtensorflow_io.so imperatively in case of statically linking
    try:
        core_ops = _load_library("libtensorflow_io.so")
        plugin_ops = _load_library("libtensorflow_io.so", "fs")
    except NotImplementedError as e:
        warnings.warn("file system plugins are not loaded: {}".format(e))

Ref:

Upvotes: 2

Related Questions