Daniel 61
Daniel 61

Reputation: 33

awk - how to replace semicolon in string in csv file?

I need to manage smtp logfile handling in my company.

These logfiles need to be imported to MSSQL, so it is my job to provide this data.

I got strange undelivery message with a ";" in the string, I need to replace this with a comma.

So what I got:

Sender;Recipient;Operation;Answer;Error;Servername
[email protected];[email protected];RCPT TO;450;+4.2.0+<[email protected]>:+Recipient+address+rejected:+Policy+restrictions;+try+later;M0641

Mention the ";" in the Answer field after "restrictions", dunno why the mail server sends semicolons, maybe to annoy me :P

I tried following with awk after I did a lot of research:

awk 'BEGIN{FS=OFS=";"} {for (i=5;i<=NF;i++) gsub (";",",",$i)} 1' myfile.csv

This command actually works but it seems it does nothing with my file, the ";" in the error field remains. What I am missing here ?

Upvotes: 3

Views: 3274

Answers (4)

karakfa
karakfa

Reputation: 67467

I think your problem is replacing the unquotes delimiters in your logical 4th field in a five field wide input. Although this script is repetitious should be easier to understand

$ awk '{n=split($0,a,";"); 
        for(i=1; i<4; i++)   printf "%s;", a[i]; 
        for(i=4; i<n-1; i++) printf "%s,", a[i]; 
        printf "%s;%s\n", a[n-1], a[n]}' file 

A better way to write the same based on @Ed Morton's comments

$ awk -F';' '{for(i=1; i<NF-1; i++) printf "%s"(i<4?FS:","), $i; 
              print $(NF-1) FS $NF}' file

For the input

1;2;3;4a;4b;4c;5
1;2;3;4;5

it generates

1;2;3;4a,4b,4c;5
1;2;3;4;5

Upvotes: 1

James Brown
James Brown

Reputation: 37404

If your fifth ; should be removed, append $6 to $5 and advance accordingly. This could be done with for loop (there are examples in SO) but since the fault is so near the end, we'll just do this in a simpler way:

$ awk 'BEGIN {FS=OFS=";"} NR==1 {nf=NF} NF==(nf+1) {$5=$5 "," $6; $6=$7; NF=nf} 1' file

Explained:

BEGIN {FS=OFS=";"}   # set separator
NR==1 {nf=NF}        # get field count from the first record (6)
NF==(nf+1) {         # if record is one field longer:
    $5=$5 "," $6     # append $6 to $5, comma-separated
    $6=$7            # set $7 (NF) to $6 (nf)
    NF=nf            # reset NF
} 1                  # output

Testing: Running the program and sending the output to cut -d\; -f 5 outputs:

Error
+4.2.0+<[email protected]>:+Recipient+address+rejected:+Policy+restrictions,+try+later

Upvotes: 0

John1024
John1024

Reputation: 113834

Replacing the fifth and later ; with ,

$ awk -F\; '{for (i=1;i<=NF;i++) printf "%s%s",$i,(i==NF?ORS:(i<=4?";":","))}' myfile.csv 
Sender;Recipient;Operation;Answer;Error,Servername
[email protected];[email protected];RCPT TO;450;+4.2.0+<[email protected]>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641

How it works:

  • -F\;

    This sets the field separator for input to ;.

  • for (i=1;i<=NF;i++) printf "%s%s",$i,(i==NF?ORS:(i<=4?";":","))

    This loops over every field and prints the field followed by (a) ORS if we are on the last field, or (b) , if were are on field 5 or later, or (c) ; if we are on one of the first four fields.

Replacing all ; with ,

Try:

$ awk -F\; '{$1=$1} 1' OFS=, myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
[email protected],[email protected],RCPT TO,450,+4.2.0+<[email protected]>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641

How it works:

  • -F\;

    This sets the field separator on input to a semicolon.

  • $1=$1

    This causes awk to think the the line has been changed so that awk will update the output line to use the new field separator.

  • 1

    This tells awk to print the line.

  • OFS=,

    This sets the field separator on output to a comma.

Alternative #1

$ awk '{gsub(/;/, ",")} 1' myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
[email protected],[email protected],RCPT TO,450,+4.2.0+<[email protected]>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641

Alternative #2

$ sed 's/;/,/g'  myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
[email protected],[email protected],RCPT TO,450,+4.2.0+<[email protected]>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641

Upvotes: 5

Ed Morton
Ed Morton

Reputation: 203324

If the offending semi-colons only appear in your 5th field then you can do this using GNU awk for the 3rd arg to match():

$ awk 'match($0,/(([^;]+;){4})(.*)(;[^;]+$)/,a){gsub(/;/,",",a[3]); print a[1] a[3] a[4]}' file
[email protected];[email protected];RCPT TO;450;+4.2.0+<[email protected]>:+Recipient+address+rejected:+Policy+restrictions,+try+later;M0641

Upvotes: 0

Related Questions