naval jain
naval jain

Reputation: 393

Process two very large CSV files. Load them at the same time. Replace some column in some files with columns in other file

Want to replace some of the columns in one csv file with the column values in other CSV files which cannot fit in memory together. Language contraints JAVA,SCALA. No Framwework constraints.

One of the file has key-value kind of mapping and other file have large number of columns. And we have to replace the the values in large CSV file with the values in file that have key-value mapping.

Upvotes: 0

Views: 80

Answers (1)

gtosto
gtosto

Reputation: 1341

Under the assumption that you can take in memory all the key-value mappings, then process the big one in a streaming fashion

import java.io.{File, PrintWriter}
import scala.io.Source

val kv_file = scala.io.Source.fromFile("key_values.csv")

// Construct a simple key value map
val kv : Map[String,String] = kv_file.getLines().map { line =>
  val cols = line.split(";")
  cols(0) -> cols(1)
}.toMap


val writer = new PrintWriter(new File("processed_big_file.csv" ))

big_file.getLines().foreach( line => {
  // this is the key-value replace logic (as I understood)
  val processed_cols = line.split(";").map { s => kv.getOrElse(s,s) }

  val out_line = processed_cols.mkString(";");
  writer.write(out_line)
})
// close file
writer.close()

Under the assumption that you cannotbe fully load thye key-value mapping then you could partially load in memory the file with the key-value maps and then still process the big one. Of course you have to iterate a bunch of times the files to get processed all the keys

Upvotes: 2

Related Questions