Reputation: 33
Have a 40mb single line file where there is no fixed width or delimiting character. But each record starts with a P followed by either a P or S and then a number. So might be like:
PP5 -or- PS5 Or PP0 , etc.
What's the best way to separate this out?
Upvotes: 0
Views: 60
Reputation: 2805
the line begins with P then P/S then #. Where a line begins is where one ends so why not use a fixed RS instead of regex one. Maybe
{mawk/mawk2/gawk} 'BEGIN { FS = "^$" ; RS = "\nP" ;
} FNR==1 { sub(/^P/, "") } { print "P" $0 } '
Let RS take the P off, and pad it back in print. Either print+next or single sub() for 1st row case. I prefer a condition that only runs once for FNR==1 than the opposite requiring FNR > 1.
Yes last line technically won't get split by RS. And that's one of awk's known weaknesses - final line will print with ORS the same, with or without a RS at EOF.
I wrote it this way to allow for variants that don't have RT (basically everyone else). RT makes life easy.
Upvotes: 0
Reputation: 203368
Borrowing @karakfa's sample input, this might be what you want (using GNU awk for multi-char RS and RT):
$ echo 'PP5xxxPS5yyyyPP0zzz' | awk -v RS='P[PS][0-9]|\n' 'NR>1{print pRT $0} {pRT=RT}'
PP5xxx
PS5yyyy
PP0zzz
The differences between that gawk solution and the sed solution @karakfa suggested are:
\n
in the replacement text to mean "newline" and is easily modified to use an escaped literal newline in others while the above requires GNU awk.Upvotes: 1
Reputation: 67467
$ echo PP5xxxPS5yyyyPP0zzz | awk -F'P[PS][0-9]' -v OFS='\n' '{$1=$1}1'
xxx
yyyy
zzz
since starts with the delimiter there is a blank first line, which can be eliminated if important.
If you want to preserve the delimiters, perhaps easier with sed
$ echo PP5xxxPS5yyyyPP0zzz | sed 's/P[PS][0-9]/\n&/g'
PP5xxx
PS5yyyy
PP0zzz
Upvotes: 3