Tamar
Tamar

Reputation: 175

What is the most efficient way to read lines from files stored on HDFS in Python?

I am trying to figure out a way to read lines of data from multiple text files stored on HDFS server, in Python. I need to parse each line and keep only part of the data, so I prefer not to save the files locally.

I need a way to connect to the server, go over all files in a specific folder and from each file read all lines and perform an (irrelevant to this question) action on them.

Upvotes: 2

Views: 1294

Answers (2)

Dmitry Rubanovich
Dmitry Rubanovich

Reputation: 2627

The pythonic way to do it would be to use itertools.chain. But you can write a small utility generator function which iterates over files and then iterates over lines in the files and yields one line at a time. Something like this:

def lines_in_files(connection):
    for f in # some code which fetches a files at a time from the connection
         for line in f:
             yield line

If your fetched file object doesn't support all the file methods, wrap its contents in StringIO before doing for line in.

Upvotes: 1

Jeff Hammerbacher
Jeff Hammerbacher

Reputation: 4236

The GitHub repository mentioned in the comments to the question, python-hdfs, queries HDFS from Python through libhdfs, the C interface to HDFS. Recently, WebHDFS was introduced, which provides a REST interface to HDFS. https://github.com/drelu/webhdfs-py is a Python client for WebHDFS, and is likely a better choice than python-hdfs.

Upvotes: 1

Related Questions