How do I get Spark working on Windows 10 without Hadoop?

Question

I am trying to get Spark running on Windows 10 but I keep running into errors. I have researched thoroughly but I am still running into issues, here is what I have done:

Installed JDK 1.8. (works fine)
Installed Anaconda3 (works fine)
Unzipped Spark 2.3.1
Downloaded winutils.exe from here and placed it in .\Hadoop\bin\ (outside of this one file the rest of the Hadoop folder is empty--I was told I did not need Hadoop)
Set up enviornment variables as follows:
1. User Variable : PATH = .\Continuum\anaconda3
2. System Variable :
  - JAVA_HOME = .\Java\jdk1.8.0_161
  - HADOOP_HOME = .\Hadoop
  - PYSPARK_DRIVER_PYTHON = jupyter
  - PYSPARK_DRIVER_PYTHON_OPTS = notebook
  - Path = .\Java\jdk1.8.0_161\bin; .\Hadoop\bin; .\spark\bin; .\Hadoop\bin\
I created folder C: mp\hive and ran command winutils.exe chmod -R 777 mp\hive

When I run pyspark in the terminal, Jupyter starts fine and runs code. However, the following code keeps breaking. Based on my research I am fairly certain this has to do with my installation.

from datetime import datetime
from pyspark.sql.types import Row

records = sc.parallelize([[1, "Alice", 50],[2, "Bob", 80]])

df = records.toDF()

Which results in the error:

Py4JJavaError                             Traceback (most recent call last)
~\spark\python\pyspark\sql\utils.py in deco(*a, **kw)
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:

~\spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    327                     "An error occurred while calling {0}{1}{2}.
".
--> 328                     format(target_id, ".", name), value)
    329             else:

Py4JJavaError: An error occurred while calling o24.applySchemaToPythonRDD.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-;
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
    at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
    at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
    at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:102)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder.externalCatalog(HiveSessionStateBuilder.scala:39)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog$lzycompute(HiveSessionStateBuilder.scala:54)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder.catalog(HiveSessionStateBuilder.scala:52)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:69)
    at org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:69)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
    at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anonfun$build$2.apply(BaseSessionStateBuilder.scala:293)
    at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:79)
    at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:79)
    at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
    at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
    at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
    at org.apache.spark.sql.SparkSession.internalCreateDataFrame(SparkSession.scala:577)
    at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:752)
    at org.apache.spark.sql.SparkSession.applySchemaToPythonRDD(SparkSession.scala:737)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
    at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
    at org.apache.spark.sql.hive.client.HiveClientImpl.(HiveClientImpl.scala:114)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:385)
    at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
    at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
    at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:195)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
    at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:195)
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
    ... 30 more
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rw-rw-rw-
    at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
    at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
    ... 45 more


During handling of the above exception, another exception occurred:

AnalysisException                         Traceback (most recent call last)
 in ()
----> 1 df = records.toDF()

Collin Cunningham · Accepted Answer

So I figured out what worked for me:

I deleted the Hadoop repository with only the winutils.exe and instead downloaded from https://github.com/steveloughran/winutils and made the folder Hadoop-2.7.1 my HADOOP_HOME enviornment variable, then added the bin within that folder to PATH and restarted.

How do I get Spark working on Windows 10 without Hadoop?

Answers (1)

Related Questions