Reputation: 20667
I have a file that uses the pipe character as a separator since I assumed that most of my clients wouldn't use the pipe character in their data. Apparently I was wrong, but I compensated for this by specifying that any fields using the pipe character need to be wrapped in double quotes. Unfortunately, they haven't done this, but I can't have them re-export their file just yet because I need their old data, so I need to manually update their old data file to properly quote the field that has pipe characters in it.
Some of the lines are correct with 16 separators and thus 16 fields whilst some of the lines have 18 separators for only 16 fields. I need to wrap the 10th field with quotation marks only if it has 18 separators instead of 16. I have a Linux box at my disposal with awk, sed, grep, etc. and would love to have a way to script this so I don't have to do it manually.
A simplified example input/output with only 5 and 7 separators (quoting the 4th field) would be:
# Input
Field 1|Field 2|Field 3|Field 4|Field 5|Field 6
Field 1|Field 2|Field 3|Field | with | pipes|Field 5|Field 6
# Output
Field 1|Field 2|Field 3|Field 4|Field 5|Field 6
Field 1|Field 2|Field 3|"Field | with | pipes"|Field 5|Field 6
# Optional output if it is easier
Field 1|Field 2|Field 3|"Field 4"|Field 5|Field 6
Field 1|Field 2|Field 3|"Field | with | pipes"|Field 5|Field 6
Any help would be enormously appreciated!
Upvotes: 1
Views: 266
Reputation: 342363
KISS. When you are working with distinct fields and field separators, use a tool like awk which is totally meant for the job.
$ cat file
Field 1|Field 2|Field 3|Field 4|Field 5|Field 6
Field 1|Field 2|Field 3|Field | with | pipes|Field 5|Field 6
$ awk -F"|" 'NF>6{$4="\042"$4 ; $(NF-2)=$(NF-2)"\042";}1' OFS="|" file
Field 1|Field 2|Field 3|Field 4|Field 5|Field 6
Field 1|Field 2|Field 3|"Field | with | pipes"|Field 5|Field 6
the above says, if the number of fields are more than 6 (ie NF>6), then add a double quote (\042) to the 4th field, as well as the last 2nd field. (or change it accordingly depending on your data. )
No need to use complicated regular expression.
Upvotes: 1
Reputation: 360055
For your sample data:
sed -i '/\([^|]*|\)\{7\}/{s/\([^|]*|\)/"\1/4;s/\(|[^|]*\)/"\1/6}' inputfile
For your real data:
sed -i '/\([^|]*|\)\{17\}/{s/\([^|]*|\)/"\1/14;s/\(|[^|]*\)/"\1/16}' inputfile
Edit:
(I added a pair of missing braces in each example so the second s
command (actually both) only operate when the address matches. I also removed the -n
and the p
. Removing the p
eliminates the duplication. Sorry for the errors.)
The part before the s
command is called an "address". It selects only the lines that have 7 (or 17) pipe characters, thus excluding the s
command from operating on lines with a different number of pipe characters.
//
- The delimiters for the address\(\)
- grouping parentheses (escaped)[^|]*
- zero or more (*
) non-pipe (^|
) characters ([]
- character list delimiters)|
- and the pipe character we're interested in\{7\}
- repeat the group seven times {command; command}
- these braces delimit a block of commands that will be executed when the address is matched - together the address and the braces act like an if
statement and its associated blockSo that address matches lines that have seven groups of zero or more non-pipe characters that are each followed by a pipe character.
Then the first s
command says to replace the 4th (or 14th) pipe character and its preceding non-pipe characters with a quotation mark followed by the matched characters.
The semicolon is a command separator. Some versions of sed
require that, instead of semicolons, you use the `sed -e 'command' -e 'command' form for multicommand one-liner scripts.
By the way, the s
is a command and not part of the regex. The part between the initial pair of slashes in the commands above and the part between the initial pair of slashes in an s///
command are regexes.
Please let me know if you have additional questions.
The second s
command looks for the 6th (or 16th) pipe character and the zero or more non-pipes that follow it and replaces that with itself (the matched characters) and a quotation mark.
Upvotes: 1