Reputation: 41
I've been trying pyarrow installation via pip (pip install pyarrow
, and, as suggested Yagav: py -3.7 -m pip install --user pyarrow
) and conda (conda install -c conda-forge pyarrow
, also used conda install pyarrow
) , building lib from src (using conda environment and some magic, which I don’t really understand), but all the time, after installation (with no errors) it ends with one and the same problem, when I call:
import pyarrow as pa
fs = pa.hdfs.connect(host='my_host', user='my_user@my_host', kerb_ticket='path_to_kerb_ticket')
it fails with next message:
Traceback (most recent call last): File "", line 1, in File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 209, in connect extra_conf=extra_conf) File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 37, in __init__ _maybe_set_hadoop_classpath() File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 135, in _maybe_set_hadoop_classpath classpath = _hadoop_classpath_glob(hadoop_bin) File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 162, in _hadoop_classpath_glob return subprocess.check_output(hadoop_classpath_args) File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 395, in check_output **kwargs).stdout File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 472, in run with Popen(*popenargs, **kwargs) as process: File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 775, in __init__ restore_signals, start_new_session) File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1178, in _execute_child startupinfo) OSError: [WinError 193] %1 is not a valid win32 application
At first I was thinking, that there is a problem with libhdfs.so from Hadoop 2.5.6, but it seems that I was wrong about that. I guess, there is a problem not in the pyarrow or subprocess, but some system variables or dependencies.
Also I have manually defined system variables as HADOOP_HOME
, JAVA_HOME
and KRB5CCNAME
Upvotes: 3
Views: 6127
Reputation: 565
This one is the solution :
def _maybe_set_hadoop_classpath():
import subprocess
if 'hadoop' in os.environ.get('CLASSPATH', ''):
return
if 'HADOOP_HOME' in os.environ:
hadoop_bin = os.path.normpath(os.environ['HADOOP_HOME']) +"/bin/" #'{0}/bin/hadoop'.format(os.environ['HADOOP_HOME'])
else:
hadoop_bin = 'hadoop'
os.chdir(hadoop_bin)
hadoop_bin_exe = os.path.join(hadoop_bin, 'hadoop.cmd')
print(hadoop_bin_exe)
classpath = subprocess.check_output([hadoop_bin_exe, 'classpath', '--glob'])
os.environ['CLASSPATH'] = classpath.decode('utf-8')
Upvotes: 1
Reputation: 41
Ok, I found it by myself.
As I've been thinking, the broblem was in system environvent variables, it needs to have CLASSPATH
variable, which contains paths to all .jar files of hadoop client, you can get them using hadoop classpath
or hadoop classpath --glob
in cmd.
Upvotes: 1
Reputation: 300
you could use the code n the cmd so you install pyarrow correctely.
py -3.7 -m pip install --user pyarrow
after installing try the the code.
Upvotes: 0