Best way to process line at a time data from hdfs file from within CPython (without using stdin)?

Question

I would like to use CPython in a hadoop streaming job that needs access to supplementary information from a line-oriented file kept in a hadoop file system. By "supplementary" I mean that this file is in addition to the information delivered via stdin. The supplementary file is large enough that I can't just slurp it into memory and parse out the end-of-line characters. Is there a particularly elegant way (or library) to process this file one line at a time?

Thanks,

SetJmp

Donald Miner · Accepted Answer

Check out this documentation for Streaming for using the Hadoop Distributed Cache in Hadoop Streaming jobs. You first upload the file to hdfs, then you tell Hadoop to replicate it everywhere before running the job, then it conveniently places a symlink in the working directory of the job. You can then just use python's open() to read the file with for line in f or whatever.

The distributed cache is the most efficient way to push files around (out of the box) for a job to utilize as a resource. You do not just want to open the hdfs file from your process, as each task will attempt to stream the file over the network... With the distributed cache, one copy is downloaded even if several tasks are running on the same node.

First, add -files hdfs://NN:9000/user/sup.txt#sup.txt to your command-line arguments when you run the job.

Then:

for line in open('sup.txt'):
    # do stuff

Best way to process line at a time data from hdfs file from within CPython (without using stdin)?

Answers (2)

Related Questions