user175084
user175084

Reputation: 4640

Use sed to format huge files

I have a test file that is 4.5GB and I am trying to format it.

First I am trying to replace the tabs with ',' and also separate each column field with a ".

Second I am formatting a date field in the file with awk and sed.

Here is what I am using:

For formatting:

cat test_sample.csv | sed -e 's/"/""/g' | sed -e 's/\t/","/g' | sed -e 's/$/"/g' | sed -e 's/^/"/' > test_sample.csv

For Date:

awk 'BEGIN{FS=OFS="\",\""} NR>1{cmd = "date -d \"" $10 "\" \"+%Y-%m-%d\"";cmd | getline out; $10=out; close("uuidgen")} 1' test_sample.csv > _report.tmp && mv _report.tmp test_sample.csv

These commands are runnig fine for small files but are failing and is clearing all the data in the file.

Please can someone help me format this file?

Upvotes: 0

Views: 528

Answers (1)

Benjamin W.
Benjamin W.

Reputation: 52441

They also clear small files because redirection happens first, so the file gets truncated and stays empty.

Consider:

$ cat file.txt
A line of text
$ cat file.txt > file.txt
$ cat file.txt      # Empty!

To avoid that, you have to copy to a temporary file – which the -i option in sed does for you. It optionally takes an extension:

sed -i.bak '...'

This addresses your file truncation problem.

As for the rest:

  • Don't call sed many times like this:

    sed 's/pattern1/replacement1/' file | sed 's/pattern2/replacement2/' | ...
    

    This goes for each command through the complete file, making the process much slower. Use this instead:

    sed 's/pattern1/replacement1/;s/pattern2/replacement2/...'
    

    to process the file just once.

  • You don't have to use cat to pipe into sed: sed takes a filename as an argument and you can avoid this Useless Use of cat. Even more so if you combine the commands and avoid all pipes, see below.
  • Don't combine sed and awk. As a rule of thumb, if you use awk anywhere, you don't need sed.

A combined single-pass in-place sed command could look like this1:

sed -i 's/"/""/g;s/\t/","/g;s/$/"/;s/^/"/' test_sample.csv

And reducing everything to a single awk command (not as one-liner friendly any longer, but definitely faster than combining sed and awk):

awk 'BEGIN { OFS="," }
NR > 1 {
    gsub(/"/, "\"\"")
    for (i = 1; i <= NF; ++i)
        $i = "\"" $i "\""
    cmd = "date -d \"" $10 "\" \"+%Y-%m-%d\""
    cmd | getline out
    $10 = out
    close("uuidgen")
    print
}' test_sample.csv > _report.tmp && mv _report.tmp test_sample.csv

1 BSD sed as found in Mac OS requires -i''.

Upvotes: 5

Related Questions