Reputation: 501
I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files.
From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe.
Now I want to save dataframes into respected file in HDFS. (every time appending output into same file ) So at the end I have three CSV file named unigram.csv, bigram.csv, trigram.csv containing result of thousands of input file.
If this scenario doesn't possible with HDFS, can you suggest it with using local disk as storage path.
Upvotes: 0
Views: 90
Reputation: 4045
File append in normal programming language is not similar to what Dataframe write mode append is. Whenever we ask Dataframe to save to a location folder it will create a new file for every append . Only way you can achieve it by,
dfOld : Dataframe
dfOld.union(dfNewToAppend)
.coalesce(1)
/tempWrite
/tempWrite
folder your output folder name val spark = SparkSession.builder.master("local[*]").getOrCreate;
import org.apache.hadoop.fs._
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
/// Write you unigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/unigram.csv")
/// Write you bigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/bigram.csv")
/// Write you trigram Dataframe
fs.rename(new Path(".../achyuttest.csv/part-00000"), new Path("yourNewHDFSDir/trigram.csv")
```
Upvotes: 1