Reputation: 2473
I'm trying to create a Windows 10 developer vm with a Conda environment and PySpark but seeing constant problems with getting Spark & winutils to work.
Environment:
I have created C:\Hadoop\bin and downloaded winutils from here https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin (I've also tried 3.2.0).
HADOOP_HOME is C:\Hadoop and Path has %HADOOP_HOME\bin in it. JAVA_HOME is correct.
This code works:
location = 'C:/myfiles/file.csv'
df = spark.read.format("csv").options(header=True).load(location)
This code fails:
location = 'C:/myfiles/'
df = spark.read.format("csv").options(header=True).load(location)
Error message:
An error occurred while calling o35.load.
: java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
Winutils is being picked up because if I delete it the first example above then breaks as well in the expected way.
It seems that winutils is incompatible with Spark 3.1.1 and specifically a folder of files? I find it hard to believe that.
Bizarrely though I have another machine with pySpark 3.1.1 and this version of winutils and it works! Same Java version as well. I'm lost - I've even copied the winutils files from teh working machine to this one and it still didn't work.
Can anyone guide me on what that error means at least to help me understand where the issue could be?
Upvotes: 1
Views: 4977
Reputation: 586
In my case, I was able to resolve the issue by adding the hadoop.dll
from the same link as provided in the question above(here) to the same location as winutils. Just keep a check there is no version mismatch.
Upvotes: 3