huy
huy

Reputation: 1914

How to save file to HDFS with Pyspark deploy-mode cluster?

I am using 3 computers to run a pyspark job. 1 computer is the master node and computer A, B is the slave node.

I have a list like this and I want to save this as a text file on the HDFS:

data = [['1', '2'], ['3']]

This is how I save the list as a text file on the HDFS:

def save_file_hdfs(data, session, path):
    """
    Use this to save the file to HDFS.
    The saved file will be named "part-00000"
    """
    # First need to convert the list to parallel RDD
    rdd_list = session.sparkContext.parallelize(data)

    # Use the map function to write one element per line and write all elements to a single file (coalesce)
    rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(path)

Main code:

import pyspark
from pyspark.sql import SparkSession

output_path = "hdfs:///user/output"

spark = SparkSession.builder.appName("build_graph").getOrCreate()

save_file_hdfs(data, spark, output_path)

Run command:

spark-submit \
--master yarn \
--deploy-mode cluster \
example.py

Then I got this error from the stdout file of computer A, the computer B is run normally:

Traceback (most recent call last):
  File "connected.py", line 158, in <module>
    save_file_hdfs(data, spark, output_path)
  File "example.py", line 135, in save_file_hdfs
    rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(path)
  File "/tmp/hadoop-tnhan/nm-local-dir/usercache/tnhan/appcache/application_1621226092496_0024/container_1621226092496_0024_02_000001/pyspark.zip/pyspark/rdd.py", line 1656, in saveAsTextFile
  File "/tmp/hadoop-tnhan/nm-local-dir/usercache/tnhan/appcache/application_1621226092496_0024/container_1621226092496_0024_02_000001/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
  File "/tmp/hadoop-tnhan/nm-local-dir/usercache/tnhan/appcache/application_1621226092496_0024/container_1621226092496_0024_02_000001/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
  File "/tmp/hadoop-tnhan/nm-local-dir/usercache/tnhan/appcache/application_1621226092496_0024/container_1621226092496_0024_02_000001/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o111.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://tnhanComputer:9000/user/output already exists

If I run it without setting the master yarn deploy-mode cluster, it runs normally.

Do you know why I get this error?

Upvotes: 0

Views: 639

Answers (1)

pltc
pltc

Reputation: 6082

The error is pretty clear, the path already exists FileAlreadyExistsException: Output directory hdfs://tnhanComputer:9000/user/output already exists

Upvotes: 1

Related Questions