Reputation: 712
I am writing a python code to run on a Hadoop Cluster and need to store some intermediate data in a file. Since I want to run the code on the cluster, I want to write the intermediate data into the /tmp
directory on HDFS. I will delete the file immediately after I use it for the next steps. How can I do that?
I know I can use subprocess.call()
but how do I write the data into the file? The data I want to write is inside a list.
I tried the following syntax:
for item in mylist:
subprocess.call(["echo '%s' | hadoop fs -put - /tmp/t"%item], shell=True)
It writes fine but there is a problem here: For the second record onwards, it throws an error /tmp/t
already exists.
Is there a way I can do this?
Upvotes: 1
Views: 2959
Reputation: 96
Just change '-put' to '-appendToFile':
for item in mylist:
subprocess.call(["echo '%s' | hadoop fs -**appendToFile** - /tmp/t"%item], shell=True)
Upvotes: 1
Reputation: 510
You're facing that error because HDFS cannot append the files when written from shell. You need to create new files each time you want to dump files.
A better way to do this would be to use a python HDFS client to do the dumping for you. I can recommend snakebite, pydoop and hdfs packages. I haven't tried the third one though so I cannot comment on them, but the other two work fine.
Upvotes: 0