Reputation: 133
also configured the necessary parameters so that python can read from hdfsh:
export ARROW_LIBHDFS_DIR='/opt/hadoop/lib/native'
export HADOOP_COMMON_LIB_NATIVE_DIR='/opt/hadoop/lib/native'
export HADOOP_OPTS="-Djava.library.path=/opt/hadoop/lib/"
for ls $ARROW_LIBHDFS_DIR
i got:
libhadoop.a libhadooppipes.a libhdfs.so libnativetask.so
libhadoop.so libhadooputils.a libhdfs.so.0.0.0 libnativetask.so.1.0.0
my python code:
import pandas as pd
pd.read_parquet('hdfs:///tmp/data/test.parquet', engine='pyarrow')
error I get:
WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hdfsGetPathInfo(hdfs:///tmp/data/test.parquet): getFileInfo error:
ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Messagejava.lang.ClassCastException: org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto cannot be cast to com.google.protobuf.Message
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:225)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:900)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
hdfsGetPathInfo(hdfs:///tmp/data/test.parquet): getFileInfo error:
IllegalStateException: java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at org.apache.hadoop.ipc.Client.setCallIdAndRetryCount(Client.java:117)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:162)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1654)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1579)
at org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1591)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1734)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py", line 296, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/parquet.py", line 125, in read
path, columns=columns, **kwargs
File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1544, in read_table
partitioning=partitioning)
File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1173, in __init__
open_file_func=partial(_open_dataset_file, self._metadata)
File "/usr/local/lib/python3.5/dist-packages/pyarrow/parquet.py", line 1368, in _make_manifest
.format(path))
OSError: Passed non-file path: hdfs:///tmp/data/test.parquet
Upvotes: 2
Views: 1165
Reputation: 11
try this:
import pyarrow.parquet
dataset = pyarrow.parquet.ParquetDataset("hdfs:///tmp/data/test.parquet")
table = dataset.read()
df = table.to_pandas()
Upvotes: 1