Reputation: 2888
I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect()
I also know I can read a parquet file using pyarrow.parquet
's read_table()
However, read_table()
accepts a filepath, whereas hdfs.connect()
gives me a HadoopFileSystem
instance.
Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict()
function, then I can pass the data along.
Upvotes: 4
Views: 14850
Reputation: 43
You can read and write with pyarrow as depicted in the accepted answer. However the APIs provided there are long deprecated and don't work with recent versions of hadoop. Use:
from pyarrow import fs
import pyarrow.parquet as pq
# connect to hadoop
hdfs = fs.HadoopFileSystem('hostname', 8020)
# will read single file from hdfs
with hdfs.open_input_file(path) as pqt:
df = pq.read_table(pqt).to_pandas()
# will read directory full of partitioned parquets (ie. from spark)
df = pq.ParquetDataset(path, hdfs).read().to_pandas()
Upvotes: 0
Reputation: 143
I tried the same via Pydoop library and engine = pyarrow and it worked perfect for me.Here is the generalized method.
!pip install pydoop pyarrow
import pydoop.hdfs as hd
#read files via Pydoop and return df
def readParquetFilesPydoop(path):
with hd.open(path) as f:
df = pd.read_parquet(f ,engine='pyarrow')
logger.info ('file: ' + path + ' : ' + str(df.shape))
return df
Upvotes: 1
Reputation: 105661
Try
fs = pa.hdfs.connect(...)
fs.read_parquet('/path/to/hdfs-file', **other_options)
or
import pyarrow.parquet as pq
with fs.open(path) as f:
pq.read_table(f, **read_options)
I opened https://issues.apache.org/jira/browse/ARROW-1848 about adding some more explicit documentation about this
Upvotes: 7