Reputation: 175
I am trying to figure out a way to read lines of data from multiple text files stored on HDFS server, in Python. I need to parse each line and keep only part of the data, so I prefer not to save the files locally.
I need a way to connect to the server, go over all files in a specific folder and from each file read all lines and perform an (irrelevant to this question) action on them.
Upvotes: 2
Views: 1294
Reputation: 2627
The pythonic way to do it would be to use itertools.chain. But you can write a small utility generator function which iterates over files and then iterates over lines in the files and yields one line at a time. Something like this:
def lines_in_files(connection):
for f in # some code which fetches a files at a time from the connection
for line in f:
yield line
If your fetched file object doesn't support all the file methods, wrap its contents in StringIO before doing for line in
.
Upvotes: 1
Reputation: 4236
The GitHub repository mentioned in the comments to the question, python-hdfs, queries HDFS from Python through libhdfs, the C interface to HDFS. Recently, WebHDFS was introduced, which provides a REST interface to HDFS. https://github.com/drelu/webhdfs-py is a Python client for WebHDFS, and is likely a better choice than python-hdfs.
Upvotes: 1