Phoen
Phoen

Reputation: 277

Reading CSV file with multi line strings in Scala

I have a csv file and I want to read it line by line. The problem is that some cell values are in quotes containing line breaks.

Here is an example CSV:

Product,Description,Price
Product A,This is Product A,20
Product B,"This is much better
than Product A",200

The standard getLines() function does not handle that.

Source.fromFile(inputFile).getLines()  // will split at every line break, regardless if quoted or not

The getLines gets something like:

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better")
Array("than Product A\"", "20")

But it should be like:

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better\nthan Product A\"", "20")

I tried it to read the file fully and split is with a RegEx similar to this post https://stackoverflow.com/a/31193505

file.mkString.split("""\n(?=(?:[^"]*"[^"]*")*[^"]*$)""")

The regex works fine, but I'm getting a stack overflow exception because the file is too big to handle it fully out of memory. I tried it with a smaller version of the file and it worked.

As stated in the post, a foldLeft() could help for bigger files. But I'm not sure how it should work, when iterating over every Char of the string, to pass all at once...

  1. the Char of your current iteration
  2. the Line what you are building
  3. and the List of already created lines

Maybe it works to write an own tail recursive version of the getLines, but I'm not sure if there isn't a more practical solution instead of handling it char by char.

Do you see any other functional-style solution to this problem?

Tanks and regards, Felix

Upvotes: 2

Views: 1217

Answers (3)

jwvh
jwvh

Reputation: 51271

I wonder if the new (Scala 2.13) unfold() can be put to good use here.

                        // "file" has been opened
val lines = Iterator.unfold(file.getLines()){ itr =>
              Option.when(itr.hasNext) {
                val sb = new StringBuilder(itr.next)
                while (itr.hasNext && sb.count(_ == '"') % 2 > 0)
                  sb.append("\\n" + itr.next)
                (sb.toString, itr)
              }
            }

Now you can iterate over the content as needed.

lines.foreach(println)
//Product,Description,Price
//Product A,This is Product A,20
//Product B,"This is much better\nthan Product A",200
//Product C,a "third rate" product,5

Note that this is pretty simple in that it just counts all the quote marks, looking for an even number. It doesn't recognize escaped quotes, \", as different, but it shouldn't be too difficult to use a regex so that it only counts non-escaped quotes.

Since we're using iterators it should be memory efficient and handle files of any size, as long as no errant single quote triggers the rest of the file to be read in as one line of text.

Upvotes: 3

C4stor
C4stor

Reputation: 8036

The most simple answer is finding an external lib to do it !

If it's not a solution for you, the foldLeft solution is imo the best functional style ! Here's a simple version :

  val lines = Source.fromFile(inputFile).getLines()

  lines.foldLeft[(Seq[String], String)](Nil, "") {
    case ((accumulatedLines, accumulatedString), newLine) => {
      val isInAnOpenString = accumulatedString.nonEmpty
      val lineHasOddQuotes =  newLine.count(_ == '"') % 2 == 1
      (isInAnOpenString, lineHasOddQuotes) match {
        case (true, true) => (accumulatedLines :+ (accumulatedString + newLine)) -> ""
        case (true, false) => accumulatedLines -> (accumulatedString + newLine)
        case (false, true) => accumulatedLines -> newLine
        case (false, false) => (accumulatedLines :+ newLine) -> ""
      }
    }
  }._1

Note that this version won't handle too much special cases, like having multiple values on a line that contain multiple lines, but it should give you a good start idea.

The main idea is to foldLeft on pretty much everything you need to keep in memory, and from that change your state progressively.

As you can see, inside the foldLeft you can have as much logic as needed. In this case, I added extra booleans and a nest match case for readibility for example.

So my advice would be : foldLeft, and don't panic !

Upvotes: 3

Pritish
Pritish

Reputation: 591

You can use a third party library to do this, like opencsv

maven repo -> https://mvnrepository.com/artifact/au.com.bytecode/opencsv/2.4

code samples -> https://www.programcreek.com/java-api-examples/au.com.bytecode.opencsv.CSVReader

Upvotes: 2

Related Questions