kskp
kskp

Reputation: 712

Write a list into a file onto HDFS

I am writing a python code to run on a Hadoop Cluster and need to store some intermediate data in a file. Since I want to run the code on the cluster, I want to write the intermediate data into the /tmp directory on HDFS. I will delete the file immediately after I use it for the next steps. How can I do that?

I know I can use subprocess.call() but how do I write the data into the file? The data I want to write is inside a list.

I tried the following syntax:

for item in mylist:
    subprocess.call(["echo '%s' | hadoop fs -put - /tmp/t"%item], shell=True)

It writes fine but there is a problem here: For the second record onwards, it throws an error /tmp/t already exists.

Is there a way I can do this?

Upvotes: 1

Views: 2959

Answers (2)

David Wang
David Wang

Reputation: 96

Just change '-put' to '-appendToFile':

for item in mylist:
    subprocess.call(["echo '%s' | hadoop fs -**appendToFile** - /tmp/t"%item], shell=True)

Upvotes: 1

ar7
ar7

Reputation: 510

You're facing that error because HDFS cannot append the files when written from shell. You need to create new files each time you want to dump files.

A better way to do this would be to use a python HDFS client to do the dumping for you. I can recommend snakebite, pydoop and hdfs packages. I haven't tried the third one though so I cannot comment on them, but the other two work fine.

Upvotes: 0

Related Questions