spacegrass5150
spacegrass5150

Reputation: 33

Separate single line data into multiple lines via regex with awk?

Have a 40mb single line file where there is no fixed width or delimiting character. But each record starts with a P followed by either a P or S and then a number. So might be like:

PP5 -or- PS5 Or PP0 , etc.

What's the best way to separate this out?

Upvotes: 0

Views: 60

Answers (3)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2805

the line begins with P then P/S then #. Where a line begins is where one ends so why not use a fixed RS instead of regex one. Maybe

{mawk/mawk2/gawk} 'BEGIN { FS = "^$" ; RS = "\nP" ;
    
    } FNR==1 { sub(/^P/, "") } { print "P" $0 } ' 

Let RS take the P off, and pad it back in print. Either print+next or single sub() for 1st row case. I prefer a condition that only runs once for FNR==1 than the opposite requiring FNR > 1.

Yes last line technically won't get split by RS. And that's one of awk's known weaknesses - final line will print with ORS the same, with or without a RS at EOF.

I wrote it this way to allow for variants that don't have RT (basically everyone else). RT makes life easy.

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203368

Borrowing @karakfa's sample input, this might be what you want (using GNU awk for multi-char RS and RT):

$ echo 'PP5xxxPS5yyyyPP0zzz' | awk -v RS='P[PS][0-9]|\n' 'NR>1{print pRT $0} {pRT=RT}'
PP5xxx
PS5yyyy
PP0zzz

The differences between that gawk solution and the sed solution @karakfa suggested are:

  1. The sed solution will print a blank line at the start of the output while the above won't, and
  2. The sed solution will read the whole input line into memory at once while the above will only read one RS-separated block into memory at a time. That would only matter if your input was too huge to fit in memory all at once.
  3. The sed script is portable to any version of sed that allows \n in the replacement text to mean "newline" and is easily modified to use an escaped literal newline in others while the above requires GNU awk.

Upvotes: 1

karakfa
karakfa

Reputation: 67467

$ echo PP5xxxPS5yyyyPP0zzz | awk -F'P[PS][0-9]' -v OFS='\n' '{$1=$1}1'

xxx
yyyy
zzz

since starts with the delimiter there is a blank first line, which can be eliminated if important.

If you want to preserve the delimiters, perhaps easier with sed

$ echo PP5xxxPS5yyyyPP0zzz | sed 's/P[PS][0-9]/\n&/g'

PP5xxx
PS5yyyy
PP0zzz

Upvotes: 3

Related Questions