Alekhya varma
Alekhya varma

Reputation: 95

Bash script to add double quotes in .CSV comma delimited file

I need to add double quotes to the csv file. My sample data is like this..

378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29 06:10:01",200890899011202395,0899,0279395,356058102052972,89148000005117597971,67756296

I have tried some code available online with awk and sed, it is resulting as below , Error - **First digit in the number is being trimmed like for ex. in '378478' it is only displaying '78478'.

Also it is adding double quotes to already existing double quotes too!** nothing seems to be perfectly working. Please guide me!

"78478","COMPLETED","Tracfone","","",""2020/03/29 09:39:22"","","2787","","356074101197544","89148000005748235454","75176540"
"78328","COMPLETED",""Total Wireless"",""Unlimited Talk"," Text"," & Data (First 25GB High Speed"," then unlimited 2GB)"","50",""2020/03/29 06:10:01"","200890899011202395","0899","0279395","356058102052972","89148000005117597971","67756296"
"78329","COMPLETED",""Cricket Wireless"",""Unlimited Talk"," Text"," & 4G LTE Data w/ 15GB Hotspot"","60",""2020/03/29""

This is the code I am using:

awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' file # or
sed -E 's/([^,]*) , (.*)/"\1" , "\2"/' file

My total code is the below one. my Intention was to first convert all .xlsx to .csv and then add double quotes to same csv and save it in the same file.i know the $file.csv part is wrong, hence i need some help

find "$Src_Dir" -type f -iname "*.xlsx" -print>path/temp

cat path/temp | while IFS="" read -r -d $'\0' file; 
do
    echo $file
    ssconvert "${file}" --export-type=Gnumeric_stf:stf_csv
    awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' $file > $file.csv
done

Upvotes: 2

Views: 3800

Answers (2)

aborruso
aborruso

Reputation: 5678

Your starting CSV is not a good CSV: the 2 rows have different number of columns

+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 1      | 2         | 3              | 4                                                                        | 5  | 6                   | 7 | 8    | 9 | 10              | 11                   | 12       |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+
| 378478 | COMPLETED | Tracfone       | -                                                                        | -  | 2020/03/29 09:39:22 | - | 2787 | - | 356074101197544 | 89148000005748235454 | 75176540 |
| 378328 | COMPLETED | Total Wireless | Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB) | 50 | 2020/03/29          | - | -    | - | -               | -                    | -        |
+--------+-----------+----------------+--------------------------------------------------------------------------+----+---------------------+---+------+---+-----------------+----------------------+----------+

Using Miller (https://github.com/johnkerl/miller) you could run

mlr --csv --quote-all -N unsparsify input >output

to have

"378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
"378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29","","","","","",""

You can use it downloading the executable https://github.com/johnkerl/miller/releases/tag/v5.7.0

Upvotes: 2

paxdiablo
paxdiablo

Reputation: 881153

If you want to handle anything other than the simplest CSV files, you should probably move away from sed and awk. There are much better tools available.

For example, if you sudo apt install csvtool (or equivalent) on your favourite distro, you can use its call-per-line functionality to process each line in the input file. See the following script for an example:

#!/bin/bash

function quotify {
  # Start empty line, process every field.

  line=""
  while [[ $# -ne 0 ]] ; do
      #    Append comma for all but first field, then quoted field.

      [[ -n "${line}" ]] && line="${line},"
      line="${line}\"$1\""

      shift
  done

  # Output the fully quoted line.

  echo "${line}"
}

# Needed to call functions. Also, ensure link: /bin/sh -> /bin/bash.
export -f quotify

# Pretty-print input and output.

echo "Input file:"
sed 's/^/   /' inputFile.csv

echo "Output file:"
csvtool call quotify inputFile.csv | sed 's/^/   /'

Note the quotify function which is called for each line in the CSV file, with the arguments set to each field within that line (sans quotes, whether the original fields had quotes or not).

It basically constructs a string of all the fields in the line, with quotes around them, then writes that to standard output, as shown below in the output from that script:

Input file:
   378478,COMPLETED,Tracfone,,,"2020/03/29 09:39:22",,2787,,356074101197544,89148000005748235454,75176540
   378328,COMPLETED,"Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)",50,"2020/03/29"
Output file:
   "378478","COMPLETED","Tracfone","","","2020/03/29 09:39:22","","2787","","356074101197544","89148000005748235454","75176540"
   "378328","COMPLETED","Total Wireless","Unlimited Talk, Text, & Data (First 25GB High Speed, then unlimited 2GB)","50","2020/03/29"

Even though using a separate tool is probably the easiest way to go, if you absolutely cannot install other packages, then you're going to have to code up something in a package you already have. The following bash script is a good place to start, as it uses no other tools to achieve its goal.

At the moment, it's tied to a very specific set of rules, as follows:

  • White space matters. Anything between the commas is considered part of the field. This especially matters when detecting a quoted field, it must have the quote as the first character, no abc, "d,e,f",ghi stuff since the "d,e,f" won't be handled correctly.
  • Quoted fields are allowed to contain commas, and "" sequences within them are turned into ".
  • It's probably not a good idea to supply ill-formatted CSV files :-)

But, with that in mind, here we go. I'll offer a brief textual description of each section but hopefully the comments in the code will be enough to figure out what's going on.

First, a function for finding the position if some string within another string, useful for working out the field bounds:

function findPos {
    haystack="$1"
    needle="$2"

    # Remove everything past the needle.

    prefix="${haystack%%${needle}*}"

    # If nothing was removed, it wasn't found, so supply massive number.
    # Otherwise, it was found at the length of the string with removed stuff.

    position=999999
    [[ ${#prefix} -ne ${#haystack} ]] && position=${#prefix}
    echo ${position}
}

Then we can use that in the function that works out the length of the next field. This basically just looks for the next comma for unquoted fields, and does special handling for quoted fields by building up the field from segments (it has to handle quotes within quotes and commas):

function getNextFieldLen {
    line="$1"

    # Empty line means all work done.

    [[ -z "${line}" ]] && echo -1 && return

    # Handle unquoted first, this is easy.

    [[ "${line:0:1}" != '"' ]] && { echo $(findPos "${line}" ","); return; }

    # Now handle quoted. Loop over all segments where a segment is defined as
    # the text up to the next <"">, assuming it's before the next <",>.

    field=""
    nextQuoteComma=$(findPos "${line}" '",')
    nextDoubleQuote=$(findPos "${line}" '""')
    while [[ ${nextDoubleQuote} -lt ${nextQuoteComma} ]]; do
        # Append segment to the field and go back for next segment.

        field="${field}${line:0:${nextDoubleQuote}}\"\""
        line="${line:${nextDoubleQuote}}"
        line="${line:2}"

        nextQuoteComma=$(findPos "${line}" '",')
        nextDoubleQuote=$(findPos "${line}" '""')
    done

    # Add final segment (up to the comma) and output entire field.

    field="${field}${line:0:${nextQuoteComma}}\""
    echo "${#field}"
}

Finally, there's the top-level function which will quotify whatever comes in via standard input:

function quotifyStdIn {
    # Process file line by line.

    while read -r line; do
        # Start with empty output line and non-comma separator.

        outLine="" ; sep=""

        # Place terminator to make processing easier, start field loop.

        line="${line},"
        fieldLen=$(getNextFieldLen "${line}")
        while [[ ${fieldLen} -ge 0 ]]; do
            # Get field and quotify if needed, adjust line (remove field and comma).

            field="${line:0:${fieldLen}}"
            [[ "${field:0:1}" = '"' ]] || field="\"${field}\""

            line="${line:$((fieldLen+1))}"
            #line="${line:${fieldLen}}"
            #line="${line:1}"

            # Append to output line and prepare for next field.

            outLine="${outLine}${sep}${field}"; sep=","

            fieldLen=$(getNextFieldLen "${line}")
        done

        # Output built line.

        echo "${outLine}"
    done
}

And, on the off-chance you want to read directly from a file (though providing a file name that's empty or "-" will use standard input so you can probably just use the file-based function for everything):

function quotifyFile {
    file="$1"

    # Empty file or "-" means standard input, otherwise take input from real file.

    [[ ${#file} -eq 0 ]] && { quotifyStdIn; return; }
    [[ "${file}" = "-" ]] && { quotifyStdIn; return; }

    quotifyStdIn < "${file}"
}

And, finally, because every program that's not a "Hello, world" one deserves some form of test harness, this is what you can use to test the various capabilities:

(
    echo 'paxdiablo,was here'
    echo 'and,"then, strangely,",he,was,not'
    echo '50,"My name is ""Pax"", and yours is ""Bob""",42'
    echo '17,"""Love"" is grand",19'
) > harness.csv

echo "Before:"
sed "s/^/   /" harness.csv
echo "After:"
quotifyFile harness.csv | sed "s/^/   /"

rm -rf harness.csv

And, since a test harness is of little use unless you run the tests, here's the results of the first run:

Before:
   paxdiablo,was here
   and,"then, strangely,",he,was,not
   50,"My name is ""Pax"", and yours is ""Bob""",42
   17,"""Love"" is grand",19
After:
   "paxdiablo","was here"
   "and","then, strangely,","he","was","not"
   "50","My name is ""Pax"", and yours is ""Bob""","42"
   "17","""Love"" is grand","19"

Hopefully, that will be enough to get you going in the absence of being able to install packages. Of course, if one of the packages you can't install in bash itself, then you have problems that I can't help you with :-)

Upvotes: 5

Related Questions