user2340612
user2340612

Reputation: 10703

Awk/sed replace newlines

Intro:

I have been given a CSV file in which the field delimiter is the pipe characted (i.e., |). This file has a pre-defined number of fields (say N). I can discover the value of N by reading the header of the CSV file, which we can assume to be correct.

Problem:

Some of the fields contain a newline character by mistake, which makes the line appear shorter than required (i.e., it has M fields, with M < N).

What I need to create is a sh script (not bash) to fix those lines.

Attempted solution:

I tried creating the following script to try fixing the file:

if [ $# -ne 1 ]
then
    echo "Usage: $0 <filename>"
    exit
fi

# get first line
first_line=$(head -n 1 $1)

# get number of fields
num_separators=$(echo "$first_line" | tr -d -c '|' | awk '{print length}')

cat $1  | awk -v numFields=$(( num_separators + 1 )) -F '|' '
{
    totRecords = NF/numFields
    # loop over lines
    for (record=0; record < totRecords; record++) {
        output = ""
        # loop over fields
        for (i=0; i<numFields; i++) {
            j = (numFields*record)+i+1 
            # replace newline with question mark
            sub("\n", "?", $j)
            output = output (i > 0 ? "|" : "") $j 
        }
        print output
    }
}
'

However, the newline character is still present. How can I fix that problem?

Example of the CSV:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a
newline
Foo|Bar|Baz

Expected output:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz

* I don't care about the replacement, it could be a space, a question mark, whatever except a newline or a pipe (which would create a new field)

Upvotes: 5

Views: 1374

Answers (2)

agc
agc

Reputation: 8406

Based on the assumption that the last field may contain one newline. Using tac and sed:

tac file.csv | sed -n '/|/!{h;n;x;H;x;s/\n/ * /p;b};p' | tac 

Output:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz

How it works. Read the file backwards, sed is easier without forward references. If a line has no '|' separator, /|/!, run the block of code in curly braces {};, otherwise just p print the line. The block of code:

  1. h; stores the delimiter-less line in sed's hold buffer.
  2. n; fetches another line, since we're reading backwards, this is the line that should be appended to.
  3. x; exchange hold buffer and pattern buffer.
  4. H; append pattern buffer to hold buffer.
  5. x; exchange newly appended lines to pattern buffer, now there's two lines in one buffer.
  6. s/\n/ * /p; replace the middle linefeed with a " * ", now there's only one longer line; and print.
  7. b start again, leave the code block.

Re-reverse the file with tac; done.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203453

$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { reqdNF = NF; printf "%s", $0; next }
{ printf "%s%s", (NF < reqdNF ? " " : ORS), $0 }
END { print "" }

$ awk -f tst.awk file.csv
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a newline
Foo|Bar|Baz

If that's not what you want then edit your question to provide more truly representative sample input and associated output.

Upvotes: 7

Related Questions