DK2
DK2

Reputation: 661

Edit csv file in Scala

I would like to edit csv (more than 500MB) file. If I have data like

ID, NUMBER
A, 1
B, 3
C, 4
D, 5

I want to add some extra column like

ID, NUMBER, DIFF
A, 1, 0
B, 3, 2
C, 4, 1
D, 5, 1

This data also be able in ScSla data type.

(in)Orgin Csv file -> (out)(new csv file, file data(RDD type?))

Q1. Which is best way to treat data?

  1. make a new csv file from the original csv file, and then re-open the new csv file to scala data.
  2. make new scala data first and make it as csv file.

Q2. Do I need to use 'dataframe' for this? Which library or API should I use?

Upvotes: 2

Views: 1595

Answers (2)

Nicolas Rinaudo
Nicolas Rinaudo

Reputation: 6178

A fairly trivial way to achieve that is to use kantan.csv:

import kantan.csv.ops._
import kantan.csv.generic.codecs._
import java.io.File

case class Output(id: String, number: Int, diff: Int) 
case class Input(id: String, number: Int)

val data = new File("input.csv").asUnsafeCsvReader[Input](',', true)
                                .map(i => Output(i.id, i.number, 1))

new File("output.csv").writeCsv[Output](data.toIterator, ',', List("ID", "NUMBER", "DIFF"))

This code will work regardless of the data size, since at no point do we load the entire dataset (or, indeed, more than one row) in memory.

Note that in my example code, data comes from and goes to File instances, but it could come from anything that can be turned into a Reader instance - a URI, a String...

Upvotes: 3

Tzach Zohar
Tzach Zohar

Reputation: 37852

RDD vs DataFrame: both are good options. The recommendation is to use DataFrames which allows some extra optimizations behind the scenes, but for simple enough tasks the performance is probably similar. Another advantage of using DataFrames is the ability to use SQL - if you're comfortable with SQL you can just load the file, register it as temp table and query it to perform any transformation. A more relevant advantage of DataFrames is the ability to use databricks' spark-csv library to easily read and write CSV files.

Let's assume you will use DataFrames (DF) for now:

Flow: sounds like you should

  1. Load original file to a DF, call it input
  2. Transform it to the new DF, called withDiff
  3. At this point, it would make sense to cache the result, let's call the cached DF result
  4. Now you can save result to the new CSV file
  5. Use result again for whatever else you need

Upvotes: 2

Related Questions