Reputation: 661
I would like to edit csv (more than 500MB) file. If I have data like
ID, NUMBER
A, 1
B, 3
C, 4
D, 5
I want to add some extra column like
ID, NUMBER, DIFF
A, 1, 0
B, 3, 2
C, 4, 1
D, 5, 1
This data also be able in ScSla data type.
(in)Orgin Csv file -> (out)(new csv file, file data(RDD type?))
Q1. Which is best way to treat data?
Q2. Do I need to use 'dataframe' for this? Which library or API should I use?
Upvotes: 2
Views: 1595
Reputation: 6178
A fairly trivial way to achieve that is to use kantan.csv:
import kantan.csv.ops._
import kantan.csv.generic.codecs._
import java.io.File
case class Output(id: String, number: Int, diff: Int)
case class Input(id: String, number: Int)
val data = new File("input.csv").asUnsafeCsvReader[Input](',', true)
.map(i => Output(i.id, i.number, 1))
new File("output.csv").writeCsv[Output](data.toIterator, ',', List("ID", "NUMBER", "DIFF"))
This code will work regardless of the data size, since at no point do we load the entire dataset (or, indeed, more than one row) in memory.
Note that in my example code, data comes from and goes to File
instances, but it could come from anything that can be turned into a Reader
instance - a URI, a String...
Upvotes: 3
Reputation: 37852
RDD vs DataFrame: both are good options. The recommendation is to use DataFrames which allows some extra optimizations behind the scenes, but for simple enough tasks the performance is probably similar. Another advantage of using DataFrames is the ability to use SQL - if you're comfortable with SQL you can just load the file, register it as temp table and query it to perform any transformation. A more relevant advantage of DataFrames is the ability to use databricks' spark-csv library to easily read and write CSV files.
Let's assume you will use DataFrames (DF) for now:
Flow: sounds like you should
input
withDiff
result
result
to the new CSV fileresult
again for whatever else you needUpvotes: 2