nishant
nishant

Reputation: 925

Python write to hdfs file

What is the best way to create/write/update a file in remote HDFS from local python script?

I am able to list files and directories but writing seems to be a problem.

I have searched hdfs and snakebite but none of them give a clean way to do this.

Upvotes: 13

Views: 22771

Answers (3)

Andy_101
Andy_101

Reputation: 1306

try HDFS liberary.. its really good You can use write(). https://hdfscli.readthedocs.io/en/latest/api.html#hdfs.client.Client.write

Example:

to create connection:

from hdfs import InsecureClient
client = InsecureClient('http://host:port', user='ann')

from json import dump, dumps
records = [
  {'name': 'foo', 'weight': 1},
  {'name': 'bar', 'weight': 2},
]

# As a context manager:
with client.write('data/records.jsonl', encoding='utf-8') as writer:
  dump(records, writer)

# Or, passing in a generator directly:
client.write('data/records.jsonl', data=dumps(records), encoding='utf-8')

For CSV you can do

import pandas as pd
df=pd.read.csv("file.csv")
with client.write('path/output.csv', encoding = 'utf-8') as writer:
  df.to_csv(writer)

Upvotes: 3

OneCricketeer
OneCricketeer

Reputation: 191993

What's wrong with other answers

They use WebHDFS, which is not enabled by default, and insecure without Kerberos or Apache Knox.

This is what the upload function of that hdfs library you linked to uses.

Native (more secure) ways to write to HDFS using Python

You can use pyspark.

Example - How to write pyspark dataframe to HDFS and then how to read it back into dataframe?


snakebite has been mentioned, but it doesn't write files


pyarrow has a FileSystem.open() function that should be able to write to HDFS as well, though I've not tried.

Upvotes: 1

Ramsha Siddiqui
Ramsha Siddiqui

Reputation: 480

Without using a complicated library built for HDFS, you can also simply use the requests package in python for HDFS as:

import requests
from json import dumps
params = (
('op', 'CREATE')
)
data = dumps(file)  # some file or object - also tested for pickle library
response = requests.put('http://host:port/path', params=params, data=data)

If response is 200, then your connection is working! This technique lets you use all the utitities given by Hadoop's RESTful API: ls, md, get, post, etc.

You can also convert CURL commands to python through this:

  1. Get Command for HDFS: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html
  2. Convert to python: https://curl.trillworks.com/

Hope this helps!

Upvotes: 0

Related Questions