Reputation: 323
I have a hadoop cluster running on centos 6.5. I am currently using python 2.6. For unrelated reasons i can't upgrade to python 2.7. Due to this unfortunate fact i cannot install pydoop. Inside the hadoop cluster i have a large amount of raw data files named raw"yearmonthdaytimehour".txt everything in parenthesis is a number. Is there a way to make a list of all the files in a hadoop directory in python? So the program would create a list that looks something like.
listoffiles=['raw160317220001.txt', 'raw160317230001.txt', ....]
It would make everything i need to do a lot easier since to get the file from day 2 hour 15 i would just need to call dothing(listoffiles[39]). There are unrelated complications to why i have to do it this way.
I know there is a way to do this easily with local directories, but hadoop makes everything a little more complicated.
Upvotes: 1
Views: 2791
Reputation: 474
I would recommend this Python project: https://github.com/mtth/hdfs It uses HttpFS and it's actually quite simple and fast. I've been using it on my cluster with Kerberos and works like a charm. You just need to set the namenode or HttpFs service URL.
Upvotes: 1
Reputation: 1464
I would suggest checking out hdfs3
>>> from hdfs3 import HDFileSystem
>>> hdfs = HDFileSystem(host='localhost', port=8020)
>>> hdfs.ls('/user/data')
>>> hdfs.put('local-file.txt', '/user/data/remote-file.txt')
>>> hdfs.cp('/user/data/file.txt', '/user2/data')
Like Snakebite, hdfs3 use protobufs for communication and bypasses the JVM. Unlike Snakebite, hdfs3 offers kerberos support
Upvotes: 1
Reputation: 34704
If pydoop doesn't work, you can try the Snakebite library which should work with Python 2.6. Another option is enabling WebHDFS API and using that directly with requests or something similar.
print requests.get("http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS").json()
With Snakebite:
from snakebite.client import Client
client = Client("localhost", 8020, use_trash=False)
for x in client.ls(['/']):
print x
Upvotes: 1