Reputation: 3
I am trying to read a json in spark and write it back as parquet. I am running my code in windows. Below is my code. After the execution it creates a folder called output_spark.parquet. And it also throws an error that the file is not found. If i create a file and then run the code it says that the file already exists. Here is the error i get.
py4j.protocol.Py4JJavaError: An error occurred while calling o34.parquet. : java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset
Do i need file writer to write the parquet to the file? Appreciate any code snippets you might have.
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("Output.json")
df.show()
df.write.parquet("output_spark.parquet")
Upvotes: 0
Views: 1340
Reputation: 640
On Windows, Hadoop requires native code extensions so that it can integrate with the OS correctly for things like file access semantics and permissions. HOW TO FIX THIS ?
get the WINUTILS.EXE binary from a Hadoop redistribution. use this link
Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE: well search for "EDIT USER VARIABLE" in the search bar in windows and then set it
Upvotes: 1