Achyut Vyas
Achyut Vyas

Reputation: 501

Pyspark NLTK save output

I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files.

From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe.

Now I want to save dataframes into respected file in HDFS. (every time appending output into same file ) So at the end I have three CSV file named unigram.csv, bigram.csv, trigram.csv containing result of thousands of input file.

If this scenario doesn't possible with HDFS, can you suggest it with using local disk as storage path.

Upvotes: 0

Views: 90

Answers (1)

QuickSilver
QuickSilver

Reputation: 4045

File append in normal programming language is not similar to what Dataframe write mode append is. Whenever we ask Dataframe to save to a location folder it will create a new file for every append . Only way you can achieve it by,

  • Read the old file into dfOld : Dataframe
  • Combine the old and new Dataframe dfOld.union(dfNewToAppend)
  • combine to single output file .coalesce(1)
  • Write to new temporary location /tempWrite
  • Delete the old HDFS location
  • Rename the /tempWrite folder your output folder name
    val spark = SparkSession.builder.master("local[*]").getOrCreate;
    import org.apache.hadoop.fs._
    val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
    /// Write you unigram Dataframe
    fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/unigram.csv")
    /// Write you bigram Dataframe
    fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/bigram.csv")
    /// Write you trigram Dataframe
    fs.rename(new Path(".../achyuttest.csv/part-00000"), new Path("yourNewHDFSDir/trigram.csv")
      ```

Upvotes: 1

Related Questions