Edit csv file in Scala

Question

I would like to edit csv (more than 500MB) file. If I have data like

ID, NUMBER
A, 1
B, 3
C, 4
D, 5

I want to add some extra column like

ID, NUMBER, DIFF
A, 1, 0
B, 3, 2
C, 4, 1
D, 5, 1

This data also be able in ScSla data type.

(in)Orgin Csv file -> (out)(new csv file, file data(RDD type?))

Q1. Which is best way to treat data?

make a new csv file from the original csv file, and then re-open the new csv file to scala data.
make new scala data first and make it as csv file.

Q2. Do I need to use 'dataframe' for this? Which library or API should I use?

Tzach Zohar · Accepted Answer

RDD vs DataFrame: both are good options. The recommendation is to use DataFrames which allows some extra optimizations behind the scenes, but for simple enough tasks the performance is probably similar. Another advantage of using DataFrames is the ability to use SQL - if you're comfortable with SQL you can just load the file, register it as temp table and query it to perform any transformation. A more relevant advantage of DataFrames is the ability to use databricks' spark-csv library to easily read and write CSV files.

Let's assume you will use DataFrames (DF) for now:

Flow: sounds like you should

Load original file to a DF, call it input
Transform it to the new DF, called withDiff
At this point, it would make sense to cache the result, let's call the cached DF result
Now you can save result to the new CSV file
Use result again for whatever else you need

Answers (2)