Reputation: 2022
Objective: To read a remote file stored in HDFS using pydoop, from my laptop. I'm using pycharm professional edition. I'm using Cloudera CDH5.4
pyCharm Configuration on my laptop: In the project interpreter (under settings), I have directed the python compiler to be on the remote server as ssh://remote-server-ip-address:port-number/home/ashish/anaconda/bin/python2.7
Now there is a file stored in HDFS location /home/ashish/pencil/someFileName.txt
Then I install pydoop on the remote server using pip install pydoop and its installed. Then I write this code to read the file from the hdfs location
import pydoop.hdfs as hdfs
with hdfs.open('/home/ashish/pencil/someFileName.txt') as file:
for line in file:
print(line,'\n')
On execution I get the error
Traceback (most recent call last):
File "/home/ashish/PyCharm_proj/Remote_Server_connect/hdfsConxn.py", line 7, in <module>
import pydoop.hdfs as hdfs
File /home/ashish/anaconda/lib/python2.7/sitepackages/pydoop/hdfs/__init__.py", line 82, in <module>
from . import common, path
File "/home/ashish/anaconda/lib/python2.7/site-packages/pydoop/hdfs/path.py", line 28, in <module>
from . import common, fs as hdfs_fs
File "/home/ashish/anaconda/lib/python2.7/site-packages/pydoop/hdfs/fs.py", line 34, in <module>
from .core import core_hdfs_fs
File "/home/ashish/anaconda/lib/python2.7/site-packages/pydoop/hdfs/core/__init__.py", line 49, in <module>
_CORE_MODULE = init(backend=HDFS_CORE_IMPL)
File "/home/ashish/anaconda/lib/python2.7/site-packages/pydoop/hdfs/core/__init__.py", line 29, in init
jvm.load_jvm_lib()
File "/home/ashish/anaconda/lib/python2.7/site- packages/pydoop/utils/jvm.py", line 33, in load_jvm_lib
java_home = get_java_home()
File "/home/ashish/anaconda/lib/python2.7/site-packages/pydoop/utils/jvm.py", line 28, in get_java_home
raise RuntimeError("java home not found, try setting JAVA_HOME")
RuntimeError: java home not found, try setting JAVA_HOME
Process finished with exit code 1
My guess is that maybe its not able to find py4j. The location of py4j is
/home/ashish/anaconda/lib/python2.7/site-packages/py4j
And when I echo java home on the remote server,
echo $JAVA_HOME
I get this location,
/usr/java/jdk1.7.0_67-cloudera
I am new to programming in python as well as the centOS setup, please suggest what can i do to solve this problem?
Thanks
Upvotes: 0
Views: 1695
Reputation: 2022
Well, looks like I solved it. What i did was I used
sys.path.append('/usr/java/jdk1.7.0_67-cloudera')
I updated the code
import os, sys
sys.path.append('/usr/java/jdk1.7.0_67-cloudera')
input_file = '/home/ashish/pencil/someData.txt'
with open(input_file) as f:
for line in f:
print line
This code reads a file from HDFS in a remote server and then prints the output in pycharm console on my laptop.
By using sys.path.append() you don't have to manually change the hadoop.sh file and cause conflicts with other java configuration files.
Upvotes: 1
Reputation: 2291
You can try by setting JAVA_HOME
in hadoop-env.sh
(it's commented by default).
Change:
# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
To:
# The java implementation to use. Required.
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
Or whatever your java installation directory is.
Upvotes: 0