cho_joe
cho_joe

Reputation: 21

Can't seem to get correct regex for sed command

I have a CSV file where I need to replace the occurrence of a double quote followed by a line feed with a string i.e. "XXXX"

I've tried the following:

LC_CTYPE=C && LANG=C && sed 's/\"\n/XXXX/g' < input_file.csv > output_file.csv

and

LC_CTYPE=C && LANG=C && sed 's/\"\n\r/XXXX/g' < input_file.csv > output_file.csv

also tried

sed 's/\"\n\r/XXXX/g' < input_file.csv > output_file.csv

In each case, the command does not seem to recognize the specific combination of "\n in the file

It works if I look for just the double quote:

sed 's/\"/XXXX/g' < input_file.csv > output_file.csv

and if I look for just the line feed:

sed 's/\n\r/XXXX/g' < input_file.csv > output_file.csv

But no luck with the find-replace for the combined regex string

Any guidance would be most appreciated.

Adding simplified sample data

Sample input data (header row and two example records):

column1,column2
data,data<cr>
data,data"<cr>

Sample output:

column1,column2
data,data<cr>
data,dataXXXX

Update: Having some luck using perl commands in bash (MacOS) to get this done:

perl -pe 's/\"/XXXX/' input.csv > output1.csv

then

perl -pe 's/\n/YYYY/' output1.csv > output2.csv

this results in XXXXYYYY at the end of each record

I'm sure there is an easier way, but this seems to be doing the trick on a test file I've been using. Trying it out there before I use on the original 200K-line csv file.

Upvotes: 1

Views: 145

Answers (2)

Ed Morton
Ed Morton

Reputation: 204558

sed is for simple substitutions on individual lines, that is all, so this is not a job for sed.

It sounds like this is what you want (uses GNU awk for multi-char RS):

$ awk -v RS='"\n' -v ORS='XXXX' '1' file
column1,column2
data,data
data,dataXXXX$

That final $ above is my prompt, demonstrating that both the " and the subsequent newline have been replaced.

Upvotes: 3

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

You can try something like this:

sed ':a;/"\r\?$/{N;s/"\r\?\n\|"\r\?$/XXXX/;ba;}'

details:

:a                  # define the label "a"
/"\r\?$/            # condition: if the line ends with " then:
{
    N               # add the next line to the pattern space
    s/              # replace:
         "\r\?\n    # the " and the LF (or CRLF) 
      \|
         "\r\?$     # or a " at the end of the added line
                    # (this second alternative is only tested at the end
                    #  of the file)
     /XXXX/         # with XXXX
    ba              # go to label a
}

Upvotes: 1

Related Questions