Nir
Nir

Reputation: 2639

join different regex possibilities into one

Is it possible to make the following regex-es into one?

cat file.txt | \ 
sed 's/\tNULL\t/\t\\N\t/g' | \
sed 's/^NULL\t/\\N\t/g' | \
sed 's/\tNULL$/\t\\N/g' | \
sed 's/^NULL$/\\N/g'' 

Maybe another thing to add, this is going over billions of rows a month so performance is a consideration.

Solutions benchmark Thank you all for the suggestions, perl run the fastest for me. In case you wondered:

[/tmp]$ time cat /tmp/result_w_null.txt > /dev/null
real    0m0.045s
user    0m0.000s
sys     0m0.042s
[/tmp]$ time cat /tmp/result_w_null.txt | sed 's/\<NULL\>/\\N/g' > /dev/null
real    0m5.843s
user    0m2.472s
sys     0m3.852s
[/tmp]$ time cat /tmp/result_w_null.txt | sed 's/\tNULL\t/\t\\N\t/g' | sed 's/^NULL\t/\\N\t/g' | sed 's/\tNULL$/\t\\N/g' | sed 's/^NULL$/\\N/g' > /dev/null
real    0m7.078s
user    0m7.148s
sys     0m4.963s

#Suggestions:
[/tmp]$ time cat /tmp/result_w_null.txt | awk -F'\t' -v OFS='\t' '{for (i=1;i<=NF;i++) if ($i=="NULL") $i="\\N"}1' > /dev/null
real    0m20.196s
user    0m14.876s
sys     0m7.145s
[/tmp]$ time cat /tmp/result_w_null.txt | awk -v RS='(^|[\t\n])NULL(\tNULL)*([\t\n]|$)' '{ gsub(/NULL/, "\\N", RT); ORS=RT} 1' > /dev/null
real    0m10.611s
user    0m8.743s
sys     0m3.754s
[/tmp]$ time cat /tmp/result_w_null.txt | sed -E ':a; s/(\t|^)NULL(\t|$)/\1\\N\2/g; ta' > /dev/null
real    0m9.673s
user    0m5.723s
sys     0m5.678s
[/tmp]$ time cat /tmp/result_w_null.txt | perl -pe 's/(?:\t|^)\KNULL(?=\t|$)/\\N/g' > /dev/null
real    0m4.452s
user    0m3.237s
sys     0m2.288s

Upvotes: 3

Views: 93

Answers (3)

anubhava
anubhava

Reputation: 785551

Here is an alternate gnu-awk solution:

cat file

abc   NULL  foo
NULL  bar
xyz   NULL
pqr   mnop
NULL

Using gnu-awk with a custom RS:

awk -v RS='(^|[\t\n])NULL(\tNULL)*([\t\n]|$)' '{
gsub(/NULL/, "\\N", RT); ORS=RT} 1' file

abc   \N    foo
\N    bar
xyz   \N
pqr   mnop
\N

Upvotes: 2

glenn jackman
glenn jackman

Reputation: 247022

awk might be easier to comprehend for this usage:

awk '
  BEGIN {FS = OFS = "\t"}
  {
    for (i=1; i<=NF; i++)
      if ($i == "NULL")
        $i = "\\N"
    print
  }
' file.txt

Or, one-liner-ized

awk -F'\t' -v OFS='\t' '{for (i=1;i<=NF;i++) if ($i=="NULL") $i="\\N"}1' file.txt

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627082

You can use

sed -E 's/(\t|^)NULL(\t|$)/\1\\N\2/g;'

If there can be consecutive matches

sed -E ':a; s/(\t|^)NULL(\t|$)/\1\\N\2/g; ta'

See an online demo.

The POSIX ERE regex matches

  • (\t|^) - Capturing group 1 (\1 in the replacement pattern): a tab or start of string
  • NULL - a literal string
  • (\t|$) - Capturing group 2 (\2 in the replacement pattern): a tab or end of string.

For consecutive matches, you need to match in a loop, by setting a label (:a) and then branching to it using ta. This is a way to work around the lack of lookahead support that would allow checking for the trailing tab without consuming it. In Perl, you would use

perl -pe 's/(?:\t|^)\KNULL(?=\t|$)/\\N/g'

where

  • (?:\t|^) - a non-capturing group matching a tab or start of string
  • \K - match reset operator that discards all text matched so far
  • NULL - a literal string
  • (?=\t|$) - a positive lookahead that requires a tab or end of string immediately to the right of the current location.

Upvotes: 2

Related Questions