EddyTheB
EddyTheB

Reputation: 3210

Where is spark/pyspark saving my parquet files?

I'm saving a dataframe in pyspark to a particular location, but cannot see the file/files in the directory. Where are they? How do I get to them out side of pyspark? And how do I delete them? And what is it that I am missing about how spark works?

Here's how I save them...

df.write.format('parquet').mode('overwrite').save('path/to/filename')

And subsequently the following works...

df_ntf = spark.read.format('parquet').load('path/to/filename')

But no files ever appear in path/to/filename.

This is on a cloudera cluster, let me know if any other details are needed to diagnose the problem.

EDIT - This is the command I use to set up my spark contexts.

os.environ['SPARK_HOME'] = "/opt/cloudera/parcels/Anaconda/../SPARK2/lib/spark2/"
os.environ['PYSPARK_PYTHON'] = "/opt/cloudera/parcels/Anaconda/envs/python3/bin/python"                                           

conf = SparkConf()
conf.setAll([('spark.executor.memory', '3g'),
             ('spark.executor.cores', '3'),
             ('spark.num.executors', '29'),
             ('spark.cores.max', '4'),
             ('spark.driver.memory', '2g'),
             ('spark.pyspark.python', '/opt/cloudera/parcels/Anaconda/envs/python3/bin/python'),
             ('spark.dynamicAllocation.enabled', 'false'),
             ('spark.sql.execution.arrow.enabled', 'true'),
             ('spark.sql.crossJoin.enabled', 'true')
             ])

print("Creating Spark Context at {}".format(datetime.now()))

spark_ctx = SparkContext.getOrCreate(conf=conf)

spark = SparkSession(spark_ctx)
hive_ctx = HiveContext(spark_ctx)
sql_ctx = SQLContext(spark_ctx)

Upvotes: 0

Views: 2142

Answers (1)

EddyTheB
EddyTheB

Reputation: 3210

Ok, a colleague and I have figured it out. It's not complicated but we are but simple data scientists so it wasn't obvious to us.

Basically the files were being saved in a different hdfs drive, not the drive from which we run our queries using Jupyter notebooks.

We found them by doing;

hdfs dfs -ls -h /user/my.name/path/to

Upvotes: 2

Related Questions