Reputation: 1914
I am using 3 computers to run a pyspark
job. 1 computer is the master node and computer A, B is the slave node.
I have a list like this and I want to save this as a text file on the HDFS:
data = [['1', '2'], ['3']]
This is how I save the list as a text file on the HDFS:
def save_file_hdfs(data, session, path):
"""
Use this to save the file to HDFS.
The saved file will be named "part-00000"
"""
# First need to convert the list to parallel RDD
rdd_list = session.sparkContext.parallelize(data)
# Use the map function to write one element per line and write all elements to a single file (coalesce)
rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(path)
Main code:
import pyspark
from pyspark.sql import SparkSession
output_path = "hdfs:///user/output"
spark = SparkSession.builder.appName("build_graph").getOrCreate()
save_file_hdfs(data, spark, output_path)
Run command:
spark-submit \
--master yarn \
--deploy-mode cluster \
example.py
Then I got this error from the stdout
file of computer A, the computer B is run normally:
Traceback (most recent call last):
File "connected.py", line 158, in <module>
save_file_hdfs(data, spark, output_path)
File "example.py", line 135, in save_file_hdfs
rdd_list.coalesce(1).map(lambda row: str(row)).saveAsTextFile(path)
File "/tmp/hadoop-tnhan/nm-local-dir/usercache/tnhan/appcache/application_1621226092496_0024/container_1621226092496_0024_02_000001/pyspark.zip/pyspark/rdd.py", line 1656, in saveAsTextFile
File "/tmp/hadoop-tnhan/nm-local-dir/usercache/tnhan/appcache/application_1621226092496_0024/container_1621226092496_0024_02_000001/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/tmp/hadoop-tnhan/nm-local-dir/usercache/tnhan/appcache/application_1621226092496_0024/container_1621226092496_0024_02_000001/pyspark.zip/pyspark/sql/utils.py", line 128, in deco
File "/tmp/hadoop-tnhan/nm-local-dir/usercache/tnhan/appcache/application_1621226092496_0024/container_1621226092496_0024_02_000001/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o111.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://tnhanComputer:9000/user/output already exists
If I run it without setting the master yarn deploy-mode cluster
, it runs normally.
Do you know why I get this error?
Upvotes: 0
Views: 639
Reputation: 6082
The error is pretty clear, the path already exists FileAlreadyExistsException: Output directory hdfs://tnhanComputer:9000/user/output already exists
Upvotes: 1