Venkat Y
Venkat Y

Reputation: 3

Writing a dataframe in Parquet

I am trying to read a json in spark and write it back as parquet. I am running my code in windows. Below is my code. After the execution it creates a folder called output_spark.parquet. And it also throws an error that the file is not found. If i create a file and then run the code it says that the file already exists. Here is the error i get.

py4j.protocol.Py4JJavaError: An error occurred while calling o34.parquet. : java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset

Do i need file writer to write the parquet to the file? Appreciate any code snippets you might have.

    from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.json("Output.json")

df.show()

 
df.write.parquet("output_spark.parquet")

Upvotes: 0

Views: 1340

Answers (1)

Nassereddine BELGHITH
Nassereddine BELGHITH

Reputation: 640

On Windows, Hadoop requires native code extensions so that it can integrate with the OS correctly for things like file access semantics and permissions. HOW TO FIX THIS ?

  1. get the WINUTILS.EXE binary from a Hadoop redistribution. use this link

  2. Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE: well search for "EDIT USER VARIABLE" in the search bar in windows and then set it user variable

Upvotes: 1

Related Questions