nikicc
nikicc

Reputation: 638

Using sed, tr, ... to fix the structure of a file

I have the file whose lines should be

U:<text>\tD:<text>\tA:<text>\n

where < text > is some text without tab or newline characters. \t is tab and \n is newline character. Unfortunately some < text > fields contain the newline character so the structure is broken. For example like this:

U:uuu     D:ddd     A:aaa
U:uuu     D:ddd     A:aaa
U:uu
    u    D:ddd    A:aaa
U:uuu     D:ddd     A:aaa

Here there was a newline character in the field U in the 3rd line, causing that some of the content that should be in 3rd line is now in 4th. How can I fix the structure with tools like sed or tr? I want to delete those newline characters that are not at the end of my record.

So for example above the fixed file should look like this:

U:uuu     D:ddd     A:aaa
U:uuu     D:ddd     A:aaa
U:uuu     D:ddd     A:aaa
U:uuu     D:ddd     A:aaa

An other important aspect of the solution is the speed, since I have gigabytes of files to fix.

Upvotes: 1

Views: 87

Answers (1)

Jonathan Leffler
Jonathan Leffler

Reputation: 754550

Given the input data (saved in file data):

U:uuu     D:ddd     A:aaa1
U:uuu     D:ddd     A:aaa2
U:uu
    u     D:ddd     A:aaa3
U:uuu     D:ddd     A:aaa4
U:uuu     D:dd
              d     A:aaa5
U:uuu     D:ddd     A:aaa6

the sed script (saved in file sed.script):

/^U:.* D:.* A:.*/ { p; d; }
/^U:.* D:.*/ { N; s/\n *//; p; d; }
/^U:.*/ { N; s/\n *//; p; d; }

can be run and produces the output shown:

$ sed -f sed.script data
U:uuu     D:ddd     A:aaa1
U:uuu     D:ddd     A:aaa2
U:uuu     D:ddd     A:aaa3
U:uuu     D:ddd     A:aaa4
U:uuu     D:ddd     A:aaa5
U:uuu     D:ddd     A:aaa6
$

The first line of the script looks for U:, D: and A: on a single line, assumes it is complete (and not a broken A: text field) and prints the line and deletes it (which skips the other actions in the script). The second line looks for U: and D: only; the A: is presumably on the next line. It appends the next line of input, removes the embedded newline and following spaces (if any), and then prints and deletes as before. The third line looks for U: only and assumes both D: and A: are on the next line. It appends the next line, removes the embedded newline and following spaces (if any), and then prints and deletes as before.

Extending it to handle breaks in the A: text field will be non-trivial. It would also be non-trivial to extend it to handle:

U:uu
    u     D:dd
              d     A:aaa7

Neither is formally impossible (especially if you choose to use Perl or Python instead of sed), but not completely simple. The double-split is simpler to handle; inside the third line, you'd have a second set of conditional actions based on whether the A: is found or not, etc.

Handling multiple splits for a single field:

U:u
   u
    u
           D:d
              d
               d
                      A:aaa

would also be tricky — probably doable, even in sed, but tricky.

Upvotes: 2

Related Questions