Reputation: 650
I am trying to read data all the JSON files from one directory and storing them in Spark Dataframe using the code below. (it works fine)
spark = SparkSession.builder.getOrCreate()
df = spark.read.json("hdfs:///user/temp/backup_data/st_in_*/*/*.json",multiLine=True)
but when I try to save the DataFrame with multiple files, using the code below
df.write.json("hdfs:///user/another_dir/to_save_dir/")
It doesn't store the files as expected and throws error like to_save_dir
already exists
I just want to save the files just like I read it from source dir to destination dir.
edit:
The problem is that, when i read multiple files and want to write it in a directory, what is the procedure in Pyspark? The reason i am asking this is because once the spark load all the files it creates a single dataframe, and each file is a row in this dataframe, how should i proceed to create new file for each of the rows in dataframe
Upvotes: 0
Views: 4853
Reputation: 32720
The error you get is quite clear, it seems the location you're trying to write into already exists. You can choose to overwrite it by specifying the mode
:
df.write.mode("overwrite").json("hdfs:///user/another_dir/to_save_dir/")
However, if your intent is to only move files from one location to another in HDFS, you don't need to read the files in Spark and then write them. Instead, try using Hadoop FS API:
conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileUtil = sc._gateway.jvm.org.apache.hadoop.fs.FileUtil
src_path = Path(src_folder)
dest_path = Path(dest_folder)
FileUtil.copy(src_path.getFileSystem(conf),
src_path,
dest_path.getFileSystem(conf),
dest_path,
True,
conf)
Upvotes: 4