Reputation: 4640
I have a test file that is 4.5GB and I am trying to format it.
First I am trying to replace the tabs with ',' and also separate each column field with a "
.
Second I am formatting a date field in the file with awk and sed.
Here is what I am using:
For formatting:
cat test_sample.csv | sed -e 's/"/""/g' | sed -e 's/\t/","/g' | sed -e 's/$/"/g' | sed -e 's/^/"/' > test_sample.csv
For Date:
awk 'BEGIN{FS=OFS="\",\""} NR>1{cmd = "date -d \"" $10 "\" \"+%Y-%m-%d\"";cmd | getline out; $10=out; close("uuidgen")} 1' test_sample.csv > _report.tmp && mv _report.tmp test_sample.csv
These commands are runnig fine for small files but are failing and is clearing all the data in the file.
Please can someone help me format this file?
Upvotes: 0
Views: 528
Reputation: 52441
They also clear small files because redirection happens first, so the file gets truncated and stays empty.
Consider:
$ cat file.txt
A line of text
$ cat file.txt > file.txt
$ cat file.txt # Empty!
To avoid that, you have to copy to a temporary file – which the -i
option in sed does for you. It optionally takes an extension:
sed -i.bak '...'
This addresses your file truncation problem.
As for the rest:
Don't call sed many times like this:
sed 's/pattern1/replacement1/' file | sed 's/pattern2/replacement2/' | ...
This goes for each command through the complete file, making the process much slower. Use this instead:
sed 's/pattern1/replacement1/;s/pattern2/replacement2/...'
to process the file just once.
cat
to pipe into sed: sed takes a filename as an argument and you can avoid this Useless Use of cat
. Even more so if you combine the commands and avoid all pipes, see below.A combined single-pass in-place sed command could look like this1:
sed -i 's/"/""/g;s/\t/","/g;s/$/"/;s/^/"/' test_sample.csv
And reducing everything to a single awk command (not as one-liner friendly any longer, but definitely faster than combining sed and awk):
awk 'BEGIN { OFS="," }
NR > 1 {
gsub(/"/, "\"\"")
for (i = 1; i <= NF; ++i)
$i = "\"" $i "\""
cmd = "date -d \"" $10 "\" \"+%Y-%m-%d\""
cmd | getline out
$10 = out
close("uuidgen")
print
}' test_sample.csv > _report.tmp && mv _report.tmp test_sample.csv
1 BSD sed as found in Mac OS requires -i''
.
Upvotes: 5