Reputation: 2639
Is it possible to make the following regex-es into one?
cat file.txt | \
sed 's/\tNULL\t/\t\\N\t/g' | \
sed 's/^NULL\t/\\N\t/g' | \
sed 's/\tNULL$/\t\\N/g' | \
sed 's/^NULL$/\\N/g''
Maybe another thing to add, this is going over billions of rows a month so performance is a consideration.
Solutions benchmark
Thank you all for the suggestions, perl
run the fastest for me. In case you wondered:
[/tmp]$ time cat /tmp/result_w_null.txt > /dev/null
real 0m0.045s
user 0m0.000s
sys 0m0.042s
[/tmp]$ time cat /tmp/result_w_null.txt | sed 's/\<NULL\>/\\N/g' > /dev/null
real 0m5.843s
user 0m2.472s
sys 0m3.852s
[/tmp]$ time cat /tmp/result_w_null.txt | sed 's/\tNULL\t/\t\\N\t/g' | sed 's/^NULL\t/\\N\t/g' | sed 's/\tNULL$/\t\\N/g' | sed 's/^NULL$/\\N/g' > /dev/null
real 0m7.078s
user 0m7.148s
sys 0m4.963s
#Suggestions:
[/tmp]$ time cat /tmp/result_w_null.txt | awk -F'\t' -v OFS='\t' '{for (i=1;i<=NF;i++) if ($i=="NULL") $i="\\N"}1' > /dev/null
real 0m20.196s
user 0m14.876s
sys 0m7.145s
[/tmp]$ time cat /tmp/result_w_null.txt | awk -v RS='(^|[\t\n])NULL(\tNULL)*([\t\n]|$)' '{ gsub(/NULL/, "\\N", RT); ORS=RT} 1' > /dev/null
real 0m10.611s
user 0m8.743s
sys 0m3.754s
[/tmp]$ time cat /tmp/result_w_null.txt | sed -E ':a; s/(\t|^)NULL(\t|$)/\1\\N\2/g; ta' > /dev/null
real 0m9.673s
user 0m5.723s
sys 0m5.678s
[/tmp]$ time cat /tmp/result_w_null.txt | perl -pe 's/(?:\t|^)\KNULL(?=\t|$)/\\N/g' > /dev/null
real 0m4.452s
user 0m3.237s
sys 0m2.288s
Upvotes: 3
Views: 93
Reputation: 785551
Here is an alternate gnu-awk solution:
cat file
abc NULL foo
NULL bar
xyz NULL
pqr mnop
NULL
Using gnu-awk
with a custom RS
:
awk -v RS='(^|[\t\n])NULL(\tNULL)*([\t\n]|$)' '{
gsub(/NULL/, "\\N", RT); ORS=RT} 1' file
abc \N foo
\N bar
xyz \N
pqr mnop
\N
Upvotes: 2
Reputation: 247022
awk
might be easier to comprehend for this usage:
awk '
BEGIN {FS = OFS = "\t"}
{
for (i=1; i<=NF; i++)
if ($i == "NULL")
$i = "\\N"
print
}
' file.txt
Or, one-liner-ized
awk -F'\t' -v OFS='\t' '{for (i=1;i<=NF;i++) if ($i=="NULL") $i="\\N"}1' file.txt
Upvotes: 2
Reputation: 627082
You can use
sed -E 's/(\t|^)NULL(\t|$)/\1\\N\2/g;'
If there can be consecutive matches
sed -E ':a; s/(\t|^)NULL(\t|$)/\1\\N\2/g; ta'
See an online demo.
The POSIX ERE regex matches
(\t|^)
- Capturing group 1 (\1
in the replacement pattern): a tab or start of stringNULL
- a literal string(\t|$)
- Capturing group 2 (\2
in the replacement pattern): a tab or end of string.For consecutive matches, you need to match in a loop, by setting a label (:a
) and then branching to it using ta
. This is a way to work around the lack of lookahead support that would allow checking for the trailing tab without consuming it. In Perl, you would use
perl -pe 's/(?:\t|^)\KNULL(?=\t|$)/\\N/g'
where
(?:\t|^)
- a non-capturing group matching a tab or start of string\K
- match reset operator that discards all text matched so farNULL
- a literal string(?=\t|$)
- a positive lookahead that requires a tab or end of string immediately to the right of the current location.Upvotes: 2