Pyspark NLTK save output

Question

I'm using spark 2.3.1 and I'm performing NLTK on thousands of input files.

From input files I'm extracting unigram,bigram and trigram words and save it in different dataframe.

Now I want to save dataframes into respected file in HDFS. (every time appending output into same file ) So at the end I have three CSV file named unigram.csv, bigram.csv, trigram.csv containing result of thousands of input file.

If this scenario doesn't possible with HDFS, can you suggest it with using local disk as storage path.

QuickSilver · Accepted Answer

File append in normal programming language is not similar to what Dataframe write mode append is. Whenever we ask Dataframe to save to a location folder it will create a new file for every append . Only way you can achieve it by,

Read the old file into dfOld : Dataframe
Combine the old and new Dataframe dfOld.union(dfNewToAppend)
combine to single output file .coalesce(1)
Write to new temporary location /tempWrite
Delete the old HDFS location
Rename the /tempWrite folder your output folder name

    val spark = SparkSession.builder.master("local[*]").getOrCreate;
    import org.apache.hadoop.fs._
    val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
    /// Write you unigram Dataframe
    fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/unigram.csv")
    /// Write you bigram Dataframe
    fs.rename(new Path(".../achyuttest.csv/part-00000..."), new Path("yourNewHDFSDir/bigram.csv")
    /// Write you trigram Dataframe
    fs.rename(new Path(".../achyuttest.csv/part-00000"), new Path("yourNewHDFSDir/trigram.csv")
      ```

Pyspark NLTK save output

Answers (1)

Related Questions