Reputation: 53
We are using oracle's sqlldr to load data files created by a third party. The data files vary in size and some are very large.
The fields of the data file are separated by "|"
Example: field1|field2|field3|field4|field5
field3
could be:
value1|value2|value3
I need to enclose field3
with quotes if it contains |.
I've made a shell script to do this, but it is a bit slow - about 16 minutes to process a file with 47,000,000 lines.
I was thinking of doing it in awk, but I'm not that familiar with the syntax and deadlines don't permit learning/developing/debugging.
Would it be significantly faster in awk?
Is there a simple way to do it?
Thanks
Upvotes: 1
Views: 153
Reputation: 103754
Given:
$ cat file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|value1|value2|field4|field5
field1|field2|value1|value2|value3|field4|field5
You can use this awk:
awk ' BEGIN{FS=OFS="|"} # sep fields on |
NF<=5{print; next} # if there are <=5, we are done with line
{s=$1 OFS $2 OFS "\"" # form first 2 fields + "
# now loop through the extra fields adding to string after quote:
for (i=3;i<=NF-5+3;i++) s=(i<NF-5+3) ? s $i OFS : s $i
s=s "\"" OFS $(NF-5+4) OFS $(NF) # finish the string
print s # then print it
}' file
Prints:
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5
And yes -- this would be significantly faster with awk vs just the shell.
If you want a one liner, I would use perl this way:
perl -F'[|]' -lpE 's/^([^|]+\|[^|]+\|)(.*)(\|[^|]+\|[^|]+)$/\1"\2"\3/ if scalar @F!=5' file
Same output.
Upvotes: 3
Reputation: 8711
Another awk
$ cat philb2
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|value1|value2|field4|field5
field1|field2|value1|value2|value3|field4|field5
$ awk -F"|" ' NF==5{print; next} {OFS="|"; v1=$(NF);v2=$(NF-1);f1=$1;f2=$2;$1=$2=""; m=substr($0,3,length($0)-length(v1 v2)-4); print f1,f2,"\"" m "\"",v2,v1; } ' philb2
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5
$
Upvotes: 2
Reputation: 203229
With any awk in any shell on every Unix box:
$ awk -F'|' 'NF>5{sub(/^([^|]*\|){2}/,"&\""); sub(/(\|[^|]*){2}$/,"\"&")} 1' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5
Alternatives:
With a sed that has -E
to enable EREs (e.g. GNU and BSD/OSX sed):
$ sed -E 's/^(([^|]*\|){2})(.*\|.*)((\|[^|]*){2})/\1"\3"\4/' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5
With any POSIX sed:
$ sed 's/^\(\([^|]*|\)\{2\}\)\(.*|.*\)\(\(|[^|]*\)\{2\}\)/\1"\3"\4/' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5
With GNU awk for gensub():
$ awk '{$0=gensub(/^(([^|]*\|){2})(.*\|.*)((\|[^|]*){2})$/,"\\1\"\\3\"\\4",1)} 1' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5
With GNU awk for the 3rd arg to match():
$ awk 'match($0,/^(([^|]*\|){2})(.*\|.*)((\|[^|]*){2})$/,a){$0=a[1] "\"" a[3] "\"" a[4]} 1' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5
The above were run against the sample input file created by @dawg:
$ cat file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|value1|value2|field4|field5
field1|field2|value1|value2|value3|field4|field5
Upvotes: 3
Reputation: 12347
Use this Perl one-liner using splice
, which runs for ~1.5 min on 47e6 lines:
perl -F'[|]' -lane '@first = splice @F, 0, 2; @last = splice @F, -2, 2; print join "|", @first, ( @F > 1 ? q{"} . ( join "|", @F ) . q{"} : @F ), @last;' in_file
Input:
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|val1|val2|val3|field4|field5
Output:
field1|field2|"field3"|field4|field5
field1|field2|""|field4|field5
field1|field2|"val1|val2|val3"|field4|field5
The Perl one-liner uses these command line flags:
-e
: Tells Perl to look for code in-line, instead of in a file.
-n
: Loop over the input one line at a time, assigning it to $_
by default.
-l
: Strip the input line separator ("\n"
on *NIX by default) before executing the code in-line, and append it when printing.
-a
: Split $_
into array @F
on whitespace or on the regex specified in -F
option.
-F'/[|]/'
: Split into @F
on |
, rather than on whitespace.
SEE ALSO:
perldoc perlrun
: how to execute the Perl interpreter: command line switches
Benchmark:
# Make input file with 47e6 lines:
perl -le '
$s =
"field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|val1|val2|val3|field4|field5";
print $s for 1..15_666_667;
' > in_file.txt
wc -l in_file.txt
# 47_000_001
time perl -F'[|]' -lane '@first = splice @F, 0, 2; @last = splice @F, -2, 2; print join "|", @first, ( @F > 1 ? q{"} . ( join "|", @F ) . q{"} : @F ), @last;' in_file.txt > out_file.txt
Runs for a median of 1 min 31 sec. Measured 3 times using perl 5, version 30, subversion 3 (v5.30.3) built for darwin-thread-multi-2level, running on MacBook Pro, macOS 10.14.6.
Upvotes: 2