Chocobo
Chocobo

Reputation: 69

Complex CSV to arrays

I have a CSV file created by MSExcel which contains random carriage returns within elements. Each row is separated by a carriage return in the final CSV (I don't know if it's MSExcel specific).

Let's say I have 10 columns and 100 rows, some elements on the last column has carriage returns in themselves (hence quoted by double quotes in the final output), and some doesn't.

Is there anyway I could split those 100 rows correctly?

Even if I'm ready to sacrifice the ones that don't have carriage returns and manually adding one at the end, making all 10th, 20th, 30th elements end with " + \n, JS can't seem to recognize "\"\n" as a valid syntax.

Edit: Would this kind of operation better be done with php + database? If so, where should I start?

Upvotes: 3

Views: 764

Answers (3)

Joel Coehoorn
Joel Coehoorn

Reputation: 415600

If you're thinking about a .split() function (a la simple split on each comma) to parse csv data, you're thinking about it the wrong way. Typical generic Split() functions are wrong for csv because there are all kinds of edge cases that can trip them up, and because the performance is not good.

The right way to parse csv data is with a purpose-built state machine (and not a state machine defined by a regex). The good news is that you don't have to write that state machine yourself. The bad news is that you have to be careful, because Google is littered with bad javascript csv parsers... the current first result of a quick search I did, for example, is a bad example hosted here on Stack Overflow itself. The example relies too heavily on regex, which has a hard time with nested quoted text. It's possible to build a regular expression that will work, but it's also easy to get wrong, is hard to maintain, and the performance of the expression won't typcially be as good (because the expression will need to do back-tracking). Parsing CSV data soley with regular expressions is almost as bad as parsing html with regular expressions.

Here's the first good (state machine -based) example I saw in Google:

http://yawgb.blogspot.com/2009/03/parsing-comma-separated-values-in.html

This particular parser assumes commas as the delimter (as opposed to something like tabs, semi-colons, or pipes) and assumes double escaped quotes (quotes inside a quoted text field are escaped by themselves, like this: ""). If that matches your data, then this example will probably be good — but again, this was a quick search; I didn't look too closely myself. Otherwise, keep trolling google or use this example to write your own.

Moving on from there, I'm curious because it sounds a little bit like you might be using Excel or flat csv files as the primary data store for your web site. This is also just a really bad idea. Both Excel and flat files have huge concurrency problems as you start to have several people using the page at about the same time. Performance will likely be a problem, too, though I hesitate to stretch that point too far; it's better to say that performance will be what you make of it, but it's easy to get very wrong for flat files.

Upvotes: 4

Evan Plaice
Evan Plaice

Reputation: 14140

It's harder than you think...

Well, my first post was conveniently deleted by a mod because I posted an external link to my OSS CSV parser project.

So... I'll just post the complete ND-FSM (Non-Deterministic Finite State Machine) line splitter that's necessary to handle values that contain strings.

Here we go:

splitLines: function(csv, delimiter) {
  var state = 0;
  var value = "";
  var line = "";
  var lines = [];
  function endOfRow() {
    lines.push(value);
    value = "";
    state = 0;
  };
  csv.replace(/(\"|,|\n|\r|[^\",\r\n]+)/gm, function (m0){
    switch (state) {
      // the start of an entry/value
      case 0:
        if (m0 === "\"") {
          state = 1;
        } else if (m0 === "\n") {
          endOfRow();
        } else if (/^\r$/.test(m0)) {
          // carriage returns are ignored
        } else {
          value += m0;
          state = 3;
        }
        break;
      // delimited input  
      case 1:
        if (m0 === "\"") {
          state = 2;
        } else {
          value += m0;
          state = 1;
        }
        break;
      // delimiter found in delimited input
      case 2:
        // is the delimiter escaped?
        if (m0 === "\"" && value.substr(value.length - 1) === "\"") {
          value += m0;
          state = 1;
        } else if (m0 === ",") {
          value += m0;
          state = 0;
        } else if (m0 === "\n") {
          endOfRow();
        } else if (m0 === "\r") {
          // Ignore
        } else {
          throw new Error("Illegal state");
        }
        break;
      // un-delimited input
      case 3:
        if (m0 === ",") {
          value += m0;
          state = 0;
        } else if (m0 === "\"") {
          throw new Error("Unquoted delimiter found");
        } else if (m0 === "\n") {
          endOfRow();
        } else if (m0 === "\r") {
          // Ignore
        } else {
          throw new Error("Illegal data");
        }
          break;
      default:
        throw new Error("Unknown state");
    }
    return "";
  });
  if (state != 0) {
    endOfRow();
  }
  return lines;
}

If you understand complexity theory, here's the mapping:

NDFSM Diagram

States:

  • 0: The start of a new value/entry
  • 1: Is quoted (ie is surrounded by double quotes)
  • 2: Double quote encountered
  • 3: Is not quoted

Note: This only handles the line-splitting portion. The rest is handled by a separate (and very complex) regex routine.

I can't take credit for coming up with the idea to use a lexer. That goes to another dev who I won't name publicly without permission.

If you'd like to gain the ability to handle any RFC 4180 compliant CSV data without the head-ache of implementing it yourself, take a look at the project mentioned in my profile. Otherwise, good luck...

Upvotes: 1

Kye
Kye

Reputation: 4495

I'd probably recommend doing this in PHP unless you really need to do it in JS; I'm pretty sure that there is a PHP library for processing CSV files anyway.

Upvotes: 0

Related Questions