philb
philb

Reputation: 53

Enclose fields field separators in quotes

We are using oracle's sqlldr to load data files created by a third party. The data files vary in size and some are very large.

The fields of the data file are separated by "|"

Example: field1|field2|field3|field4|field5

field3 could be:

I need to enclose field3 with quotes if it contains |.

I've made a shell script to do this, but it is a bit slow - about 16 minutes to process a file with 47,000,000 lines.

I was thinking of doing it in awk, but I'm not that familiar with the syntax and deadlines don't permit learning/developing/debugging.

Would it be significantly faster in awk?

Is there a simple way to do it?

Thanks

Upvotes: 1

Views: 153

Answers (4)

dawg
dawg

Reputation: 103754

Given:

$ cat file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|value1|value2|field4|field5
field1|field2|value1|value2|value3|field4|field5

You can use this awk:

awk  '  BEGIN{FS=OFS="|"}        # sep fields on |
        NF<=5{print; next}       # if there are <=5, we are done with line
        {s=$1 OFS $2 OFS "\""    # form first 2 fields + "
        # now loop through the extra fields adding to string after quote:
        for (i=3;i<=NF-5+3;i++) s=(i<NF-5+3) ? s $i OFS : s $i
        s=s "\"" OFS $(NF-5+4) OFS $(NF)   # finish the string
        print s                            # then print it
        }' file 

Prints:

field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5

And yes -- this would be significantly faster with awk vs just the shell.


If you want a one liner, I would use perl this way:

perl -F'[|]' -lpE  's/^([^|]+\|[^|]+\|)(.*)(\|[^|]+\|[^|]+)$/\1"\2"\3/ if scalar @F!=5' file

Same output.

Upvotes: 3

stack0114106
stack0114106

Reputation: 8711

Another awk

$ cat philb2
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|value1|value2|field4|field5
field1|field2|value1|value2|value3|field4|field5
$ awk -F"|" ' NF==5{print; next} {OFS="|"; v1=$(NF);v2=$(NF-1);f1=$1;f2=$2;$1=$2=""; m=substr($0,3,length($0)-length(v1 v2)-4); print f1,f2,"\"" m "\"",v2,v1; } ' philb2
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5
$

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203229

With any awk in any shell on every Unix box:

$ awk -F'|' 'NF>5{sub(/^([^|]*\|){2}/,"&\""); sub(/(\|[^|]*){2}$/,"\"&")} 1' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5

Alternatives:

With a sed that has -E to enable EREs (e.g. GNU and BSD/OSX sed):

$ sed -E 's/^(([^|]*\|){2})(.*\|.*)((\|[^|]*){2})/\1"\3"\4/' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5

With any POSIX sed:

$ sed 's/^\(\([^|]*|\)\{2\}\)\(.*|.*\)\(\(|[^|]*\)\{2\}\)/\1"\3"\4/' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5

With GNU awk for gensub():

$ awk '{$0=gensub(/^(([^|]*\|){2})(.*\|.*)((\|[^|]*){2})$/,"\\1\"\\3\"\\4",1)} 1' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5

With GNU awk for the 3rd arg to match():

$ awk 'match($0,/^(([^|]*\|){2})(.*\|.*)((\|[^|]*){2})$/,a){$0=a[1] "\"" a[3] "\"" a[4]} 1' file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|"value1|value2"|field4|field5
field1|field2|"value1|value2|value3"|field4|field5

The above were run against the sample input file created by @dawg:

$ cat file
field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|value1|value2|field4|field5
field1|field2|value1|value2|value3|field4|field5

Upvotes: 3

Timur Shtatland
Timur Shtatland

Reputation: 12347

Use this Perl one-liner using splice, which runs for ~1.5 min on 47e6 lines:

perl -F'[|]' -lane '@first = splice @F, 0, 2; @last = splice @F, -2, 2; print join "|", @first, ( @F > 1 ? q{"} . ( join "|", @F ) . q{"} : @F ), @last;' in_file

Input:

field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|val1|val2|val3|field4|field5

Output:

field1|field2|"field3"|field4|field5
field1|field2|""|field4|field5
field1|field2|"val1|val2|val3"|field4|field5

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array @F on whitespace or on the regex specified in -F option.
-F'/[|]/' : Split into @F on |, rather than on whitespace.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Benchmark:

# Make input file with 47e6 lines:

perl -le '
$s = 
"field1|field2|field3|field4|field5
field1|field2||field4|field5
field1|field2|val1|val2|val3|field4|field5";
print $s for 1..15_666_667;
' > in_file.txt

wc -l in_file.txt
# 47_000_001

time perl -F'[|]' -lane '@first = splice @F, 0, 2; @last = splice @F, -2, 2; print join "|", @first, ( @F > 1 ? q{"} . ( join "|", @F ) . q{"} : @F ), @last;' in_file.txt > out_file.txt

Runs for a median of 1 min 31 sec. Measured 3 times using perl 5, version 30, subversion 3 (v5.30.3) built for darwin-thread-multi-2level, running on MacBook Pro, macOS 10.14.6.

Upvotes: 2

Related Questions