Charles Menguy
Charles Menguy

Reputation: 41428

Python read file as stream from HDFS

Here is my problem: I have a file in HDFS which can potentially be huge (=not enough to fit all in memory)

What I would like to do is avoid having to cache this file in memory, and only process it line by line like I would do with a regular file:

for line in open("myfile", "r"):
    # do some processing

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.

I was thinking to do this using the standard "hadoop" command line tools using the Python subprocess module, but I can't seem to be able to do what I need since there is no command line tools that would do my processing and I would like to execute a Python function for every linein a streaming fashion.

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

If there is another way to achieve what I described above without using an external library, I'm also pretty open.

Thanks for any help !

Upvotes: 32

Views: 89200

Answers (5)

SB1990
SB1990

Reputation: 43

Below worked for me.

If you want to read a file from hadoop you can have below program Program

And you need to have the property set in hdfs-site.xml

hdfs-site.xml

Upvotes: 0

Ramsha Siddiqui
Ramsha Siddiqui

Reputation: 480

You can use the WebHDFS Python Library (built on top of urllib3):

from hdfs import InsecureClient
client_hdfs = InsecureClient('http://host:port', user='root')
with client_hdfs.write(access_path) as writer:
    dump(records, writer)  # tested for pickle and json (doesnt work for joblib)

Or you can use the requests package in python as:

import requests
from json import dumps
params = (('op', 'CREATE')
('buffersize', 256))
data = dumps(file)  # some file or object - also tested for pickle library
response = requests.put('http://host:port/path', params=params, data=data)  # response 200 = successful

Hope this helps!

Upvotes: 2

Brian Dolan
Brian Dolan

Reputation: 3136

In the last two years, there has been a lot of motion on Hadoop-Streaming. This is pretty fast according to Cloudera: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/ I've had good success with it.

Upvotes: 1

simleo
simleo

Reputation: 2965

If you want to avoid adding external dependencies at any cost, Keith's answer is the way to go. Pydoop, on the other hand, could make your life much easier:

import pydoop.hdfs as hdfs
with hdfs.open('/user/myuser/filename') as f:
    for line in f:
        do_something(line)

Regarding your concerns, Pydoop is actively developed and has been used in production for years at CRS4, mostly for computational biology applications.

Simone

Upvotes: 33

Keith Randall
Keith Randall

Reputation: 23265

You want xreadlines, it reads lines from a file without loading the whole file into memory.

Edit:

Now I see your question, you just need to get the stdout pipe from your Popen object:

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)
for line in cat.stdout:
    print line

Upvotes: 47

Related Questions