How to list HDFS directory contents using webhdfs?

Question

Is it possible to check to contents of a directory in HDFS using webhdfs?

This would work as hdfs dfs -ls normally would, but instead using webhdfs.

How do I list a webhdfs directory using Python 2.6 to do so?

Michael Gaskill · Accepted Answer

You can use the LISTSTATUS verb. The docs are at List a Directory, and the following code can be found on the WebHDFS REST API docs:

With curl, this is what it looks like:

curl -i  "http://:/webhdfs/v1/?op=LISTSTATUS"

The response is a FileStatuses JSON object:

{
  "name"      : "FileStatuses",
  "properties":
  {
    "FileStatuses":
    {
      "type"      : "object",
      "properties":
      {
        "FileStatus":
        {
          "description": "An array of FileStatus",
          "type"       : "array",
          "items"      : fileStatusProperties
        }
      }
    }
  }
}

fileStatusProperties (for the items field) has this JSON schema:

var fileStatusProperties =
{
  "type"      : "object",
  "properties":
  {
    "accessTime":
    {
      "description": "The access time.",
      "type"       : "integer",
      "required"   : true
    },
    "blockSize":
    {
      "description": "The block size of a file.",
      "type"       : "integer",
      "required"   : true
    },
    "group":
    {
      "description": "The group owner.",
      "type"       : "string",
      "required"   : true
    },
    "length":
    {
      "description": "The number of bytes in a file.",
      "type"       : "integer",
      "required"   : true
    },
    "modificationTime":
    {
      "description": "The modification time.",
      "type"       : "integer",
      "required"   : true
    },
    "owner":
    {
      "description": "The user who is the owner.",
      "type"       : "string",
      "required"   : true
    },
    "pathSuffix":
    {
      "description": "The path suffix.",
      "type"       : "string",
      "required"   : true
    },
    "permission":
    {
      "description": "The permission represented as a octal string.",
      "type"       : "string",
      "required"   : true
    },
    "replication":
    {
      "description": "The number of replication of a file.",
      "type"       : "integer",
      "required"   : true
    },
   "type":
    {
      "description": "The type of the path object.",
      "enum"       : ["FILE", "DIRECTORY"],
      "required"   : true
    }
  }
};

You can process the filenames in Python using pywebhdfs, like this:

import json
from pprint import pprint
from pywebhdfs.webhdfs import PyWebHdfsClient

hdfs = PyWebHdfsClient(host='host',port='50070', user_name='hdfs')  # Use your own host/port/user_name config

data = hdfs.list_dir("dir/dir")  # Use your preferred directory, without the leading "/"

file_statuses = data["FileStatuses"]
pprint file_statuses   # Display the dict

for item in file_statuses["FileStatus"]:
    print item["pathSuffix"]   # Display the item filename

Instead of printing each object, you can actually work with the items as you need. The result from file_statuses is simply a Python dict, so it can be used like any other dict, provided that you use the right keys.

How to list HDFS directory contents using webhdfs?

Answers (1)

Related Questions