Eliot Leshchenko
Eliot Leshchenko

Reputation: 41

How to properly setup pyarrow for python 3.7 on Windows

I've been trying pyarrow installation via pip (pip install pyarrow, and, as suggested Yagav: py -3.7 -m pip install --user pyarrow) and conda (conda install -c conda-forge pyarrow, also used conda install pyarrow) , building lib from src (using conda environment and some magic, which I don’t really understand), but all the time, after installation (with no errors) it ends with one and the same problem, when I call:

import pyarrow as pa
fs = pa.hdfs.connect(host='my_host', user='my_user@my_host', kerb_ticket='path_to_kerb_ticket')

it fails with next message:

Traceback (most recent call last):
  File "", line 1, in 
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 209, in connect
    extra_conf=extra_conf)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 37, in __init__
    _maybe_set_hadoop_classpath()
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 135, in _maybe_set_hadoop_classpath
    classpath = _hadoop_classpath_glob(hadoop_bin)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyarrow\hdfs.py", line 162, in _hadoop_classpath_glob
    return subprocess.check_output(hadoop_classpath_args)
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 395, in check_output
    **kwargs).stdout
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 472, in run
    with Popen(*popenargs, **kwargs) as process:
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 775, in __init__
    restore_signals, start_new_session)
  File "C:\ProgramData\Anaconda3\lib\subprocess.py", line 1178, in _execute_child
    startupinfo)
OSError: [WinError 193] %1 is not a valid win32 application

At first I was thinking, that there is a problem with libhdfs.so from Hadoop 2.5.6, but it seems that I was wrong about that. I guess, there is a problem not in the pyarrow or subprocess, but some system variables or dependencies.

Also I have manually defined system variables as HADOOP_HOME, JAVA_HOME and KRB5CCNAME

Upvotes: 3

Views: 6127

Answers (3)

quantCode
quantCode

Reputation: 565

This one is the solution :

def _maybe_set_hadoop_classpath():
    import subprocess

    if 'hadoop' in os.environ.get('CLASSPATH', ''):
        return

    if 'HADOOP_HOME' in os.environ:
        hadoop_bin = os.path.normpath(os.environ['HADOOP_HOME'])  +"/bin/"  #'{0}/bin/hadoop'.format(os.environ['HADOOP_HOME'])
    else:
        hadoop_bin = 'hadoop'

    os.chdir(hadoop_bin)
    hadoop_bin_exe = os.path.join(hadoop_bin, 'hadoop.cmd')
    print(hadoop_bin_exe)
    classpath = subprocess.check_output([hadoop_bin_exe, 'classpath', '--glob'])
    os.environ['CLASSPATH'] = classpath.decode('utf-8')

Upvotes: 1

Eliot Leshchenko
Eliot Leshchenko

Reputation: 41

Ok, I found it by myself. As I've been thinking, the broblem was in system environvent variables, it needs to have CLASSPATH variable, which contains paths to all .jar files of hadoop client, you can get them using hadoop classpath or hadoop classpath --glob in cmd.

Upvotes: 1

Yagav
Yagav

Reputation: 300

you could use the code n the cmd so you install pyarrow correctely.

py -3.7 -m pip install --user pyarrow

after installing try the the code.

Upvotes: 0

Related Questions