srgb
srgb

Reputation: 5193

How to properly escape a double quote in CSV?

I have a line like this in my CSV:

"Samsung U600 24"","10000003409","1","10000003427"

Quote next to 24 is used to express inches, while the quote just next to that quote closes the field. I'm reading the line with fgetcsv but the parser makes a mistake and reads the value as:

Samsung U600 24",10000003409"

I tried putting a backslash before the inches quote, but then I just get a backslash in the name:

Samsung U600 24\"

Is there a way to properly escape this in the CSV, so that the value would be Samsung U600 24" , or do I have to regex it in the processor?

Upvotes: 330

Views: 339693

Answers (7)

Adam Cain
Adam Cain

Reputation: 1

I know this is an old post but here's the extremely short answer... The line in proper CSV format should be "Samsung U600 24""","10000003409","1","10000003427" Following the 24 you should have 3 double quotes. The first 2 double quotes will make the one double quote in the actual field. The third double quote ends the field started by the doublequote in front of Samsung.

Upvotes: 0

qwr
qwr

Reputation: 10919

CSV is, in theory, a simple format: tabular data delimited by commas. Regrettably there is no formal specification, so there are many subtly different implementations out there. This requires some care when importing/exporting! I will quote RFC 4180 for common implementations:

2.  Definition of the CSV Format

   While there are various specifications and implementations for the
   CSV format (for ex. [4], [5], [6] and [7]), there is no formal
   specification in existence, which allows for a wide variety of
   interpretations of CSV files.  This section documents the format that
   seems to be followed by most implementations:

   1.  Each record is located on a separate line, delimited by a line
       break (CRLF).  For example:

       aaa,bbb,ccc CRLF
       zzz,yyy,xxx CRLF

   2.  The last record in the file may or may not have an ending line
       break.  For example:

       aaa,bbb,ccc CRLF
       zzz,yyy,xxx

   3.  There maybe an optional header line appearing as the first line
       of the file with the same format as normal record lines.  This
       header will contain names corresponding to the fields in the file
       and should contain the same number of fields as the records in
       the rest of the file (the presence or absence of the header line
       should be indicated via the optional "header" parameter of this
       MIME type).  For example:

       field_name,field_name,field_name CRLF
       aaa,bbb,ccc CRLF
       zzz,yyy,xxx CRLF


   4.  Within the header and each record, there may be one or more
       fields, separated by commas.  Each line should contain the same
       number of fields throughout the file.  Spaces are considered part
       of a field and should not be ignored.  The last field in the
       record must not be followed by a comma.  For example:

       aaa,bbb,ccc

   5.  Each field may or may not be enclosed in double quotes (however
       some programs, such as Microsoft Excel, do not use double quotes
       at all).  If fields are not enclosed with double quotes, then
       double quotes may not appear inside the fields.  For example:

       "aaa","bbb","ccc" CRLF
       zzz,yyy,xxx

   6.  Fields containing line breaks (CRLF), double quotes, and commas
       should be enclosed in double-quotes.  For example:

       "aaa","b CRLF
       bb","ccc" CRLF
       zzz,yyy,xxx

   7.  If double-quotes are used to enclose fields, then a double-quote
       appearing inside a field must be escaped by preceding it with
       another double quote.  For example:

       "aaa","b""bb","ccc"

So usually

  • A field may or may not be enclosed in double quotes. (The RFC from 2005 says Excel doesn't use double quotes, but I tested with Excel 2016 and it does.)
  • Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes. (In particular, a CSV file may have multiple lines as they appear in a text editor correspond to one row of data.)
  • If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote
    • So "" in the raw CSV field represents an empty string, and """" in a raw CSV represents a single quote, ".

Usually CSV readers can detect and handle either CRLF (Windows-style) or LF (Unix-style) line endings, and the newline after the last row doesn't matter.

However you may encounter data that escapes quotes or other characters (delimiter, linebreaks, escape character itself) with an escape character like \. For example, in readr's read_csv(), this is controlled by escape_double and escape_backslash. Some unusual data use a comment character like # (default in R's read.table but not read.csv).

Always check imported data for any irregularities, including if any fields have quotes or newlines, or a row doesn't have a consistent number of columns (an error). Also be careful with programs like Excel that might strip off leading zeros from strings that look like numbers. Unlike JSON, there is no concept of data type in CSV, just text.

Aside: the MIME standard for TSV takes a simplistic approach by simply disallowing tabs and newlines in fields. LF, CR, and tab need to be escaped, usually with \n, \r, \t (backslash itself is \\). I actually quite like this approach as it completely gets rid of quoting ambiguities, and rows correspond exactly to newline splits.

Upvotes: 11

user4035
user4035

Reputation: 23729

Use 2 quotes:

"Samsung U600 24"""

RFC-4180 says "If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote."

Upvotes: 566

Parakrama Dharmapala
Parakrama Dharmapala

Reputation: 1229

Since no one has mentioned the way I usually do it, I'll just type this down. When there's a tricky string, I don't even bother escaping it.

What I do is just base64_encode and base64_decode, that is, encode the value to Base64 before writing the CSV line and when I want to read it, decode.

For your example assuming it's PHP:

$csvLine = [base64_encode('Samsung U600 24"'),"10000003409","1","10000003427"];

And when I want to take the value, I do the opposite.

$value = base64_decode($csvLine[0])

I just don't like to go through the pain.

Upvotes: -3

Scott R. Frost
Scott R. Frost

Reputation: 2061

I know this is an old post, but here's how I solved it (along with converting null values to empty string) in C# using an extension method.

Create a static class with something like the following:

    /// <summary>
    /// Wraps value in quotes if necessary and converts nulls to empty string
    /// </summary>
    /// <param name="value"></param>
    /// <returns>String ready for use in CSV output</returns>
    public static string Q(this string value)
    {
        if (value == null)
        {
            return string.Empty;
        }
        if (value.Contains(",") || (value.Contains("\"") || value.Contains("'") || value.Contains("\\"))
        {
            return "\"" + value + "\"";
        }
        return value;
    }

Then for each string you're writing to CSV, instead of:

stringBuilder.Append( WhateverVariable );

You just do:

stringBuilder.Append( WhateverVariable.Q() );

Upvotes: -4

devapriya
devapriya

Reputation: 63

I have written in Java.

public class CSVUtil {
    public static String addQuote(
            String pValue) {
        if (pValue == null) {
            return null;
        } else {
            if (pValue.contains("\"")) {
                pValue = pValue.replace("\"", "\"\"");
            }
            if (pValue.contains(",")
                    || pValue.contains("\n")
                    || pValue.contains("'")
                    || pValue.contains("\\")
                    || pValue.contains("\"")) {
                return "\"" + pValue + "\"";
            }
        }
        return pValue;
    }

    public static void main(String[] args) {
        System.out.println("ab\nc" + "|||" + CSVUtil.addQuote("ab\nc"));
        System.out.println("a,bc" + "|||" + CSVUtil.addQuote("a,bc"));
        System.out.println("a,\"bc" + "|||" + CSVUtil.addQuote("a,\"bc"));
        System.out.println("a,\"\"bc" + "|||" + CSVUtil.addQuote("a,\"\"bc"));
        System.out.println("\"a,\"\"bc\"" + "|||" + CSVUtil.addQuote("\"a,\"\"bc\""));
        System.out.println("\"a,\"\"bc" + "|||" + CSVUtil.addQuote("\"a,\"\"bc"));
        System.out.println("a,\"\"bc\"" + "|||" + CSVUtil.addQuote("a,\"\"bc\""));
    }
}

Upvotes: 3

Angelin Nadar
Angelin Nadar

Reputation: 9300

Not only double quotes, you will be in need for single quote ('), double quote ("), backslash (\) and NUL (the NULL byte).

Use fputcsv() to write, and fgetcsv() to read, which will take care of all.

Upvotes: 6

Related Questions