Use sed to format huge files

Question

I have a test file that is 4.5GB and I am trying to format it.

First I am trying to replace the tabs with ',' and also separate each column field with a ".

Second I am formatting a date field in the file with awk and sed.

Here is what I am using:

For formatting:

cat test_sample.csv | sed -e 's/"/""/g' | sed -e 's/	/","/g' | sed -e 's/$/"/g' | sed -e 's/^/"/' > test_sample.csv

For Date:

awk 'BEGIN{FS=OFS="",""} NR>1{cmd = "date -d "" $10 "" "+%Y-%m-%d"";cmd | getline out; $10=out; close("uuidgen")} 1' test_sample.csv > _report.tmp && mv _report.tmp test_sample.csv

These commands are runnig fine for small files but are failing and is clearing all the data in the file.

Please can someone help me format this file?

Benjamin W. · Accepted Answer

They also clear small files because redirection happens first, so the file gets truncated and stays empty.

Consider:

$ cat file.txt
A line of text
$ cat file.txt > file.txt
$ cat file.txt      # Empty!

To avoid that, you have to copy to a temporary file – which the -i option in sed does for you. It optionally takes an extension:

sed -i.bak '...'

This addresses your file truncation problem.

As for the rest:

Don't call sed many times like this:
```
sed 's/pattern1/replacement1/' file | sed 's/pattern2/replacement2/' | ...
```
This goes for each command through the complete file, making the process much slower. Use this instead:
```
sed 's/pattern1/replacement1/;s/pattern2/replacement2/...'
```
to process the file just once.
You don't have to use cat to pipe into sed: sed takes a filename as an argument and you can avoid this Useless Use of cat. Even more so if you combine the commands and avoid all pipes, see below.
Don't combine sed and awk. As a rule of thumb, if you use awk anywhere, you don't need sed.

A combined single-pass in-place sed command could look like this¹:

sed -i 's/"/""/g;s/	/","/g;s/$/"/;s/^/"/' test_sample.csv

And reducing everything to a single awk command (not as one-liner friendly any longer, but definitely faster than combining sed and awk):

awk 'BEGIN { OFS="," }
NR > 1 {
    gsub(/"/, """")
    for (i = 1; i <= NF; ++i)
        $i = """ $i """
    cmd = "date -d "" $10 "" "+%Y-%m-%d""
    cmd | getline out
    $10 = out
    close("uuidgen")
    print
}' test_sample.csv > _report.tmp && mv _report.tmp test_sample.csv

¹ BSD sed as found in Mac OS requires -i''.

Use sed to format huge files

Answers (1)

Related Questions