Reputation: 8711
I have a pipe separated file like below, where the double quotes are not balanced within fields, I need to fix them
name |age|role |experience|
John |25 |Developer |2.56 |
Scot"x|30 |Tester |5.2 |
Jim |28 |DBA |3.0 |
Mike" |35 |"Consultant"|10.0 |
Daniel|26 |Developer |3.2 |
Paul" |29 |Tester " |3.6 |
Peter |30 |Developer |6.5 |
Required output:
name |age|role |experience|
John |25 |Developer |2.56 |
Scot x|30 |Tester |5.2 |
Jim |28 |DBA |3.0 |
Mike |35 |"Consultant"|10.0 |
Daniel|26 |Developer |3.2 |
Paul |29 |Tester |3.6 |
Peter |30 |Developer |6.5 |
I tried below command, but it is not giving expected results..
perl -ne ' { while(/\x22/g) {$c++} ; if($c%2!=0) { s/\x22//g ; print} else { print } ; $c=0 } ' file
Upvotes: 1
Views: 1215
Reputation: 23677
Assuming you only have one double quote in problematic fields and no |
character inside fields:
$ perl -lpe 's/(^|\|)[^"|]*\K"(?=[^|"]*(\||$))/ /g' ip.txt
name |age|role |experience|
John |25 |Developer |2.56 |
Scot x|30 |Tester |5.2 |
Jim |28 |DBA |3.0 |
Mike |35 |"Consultant"|10.0 |
Daniel|26 |Developer |3.2 |
Paul |29 |Tester |3.6 |
Peter |30 |Developer |6.5 |
(^|\|)
beginning of line or field separator |
[^"|]*
zero or more of non-double quote and non |
characters\K
so that the characters so far isn't part of matched portion, something like lookbehind"
match double quote character(?=[^|"]*(\||$))
to ensure that rest of the field doesn't have double quote character
(?=[^|"]*\|)
as all the fields in given sample ends with |
Upvotes: 2
Reputation: 6798
Something like this should do it
use strict;
use warnings;
while( <DATA> ) {
chomp;
my @fields = split /\|/;
map { my $c = () = $_ =~ /"/g; s/"/ / if $c == 1 } @fields;
print join('|', @fields) . "|\n";
}
__DATA__
name |age|role |experience|
John |25 |Developer |2.56 |
Scot"x|30 |Tester |5.2 |
Jim |28 |DBA |3.0 |
Mike" |35 |"Consultant"|10.0 |
Daniel|26 |Developer |3.2 |
Paul" |29 |Tester " |3.6 |
Peter |30 |Developer |6.5 |
Upvotes: 1
Reputation: 139531
Determining whether (ASCII) double quotes are balanced is a matter of counting whether they are even in number. Balancing “ and ” requires recursive subpatterns.
A design decision remains. In the case of unbalanced double quotes, which do you drop? If the first, then think of it as an odd quote followed by some even number of quotes. Translated from English to regex:
my $unbalanced_quote_first = qr/
^ ([^"]*)
"
( (?: [^"]* " [^"]* " [^"]* )* )
([^"]*)
$
/x;
The last is an even number of double quotes followed by the odd quote.
my $unbalanced_quote_last = qr/
^ ([^"]*)
( (?: [^"]* " [^"]* " [^"]* )* )
"
([^"]*)
$
/x;
Those who want to remove a quote somewhere in the middle are truly sick puppies, consideration of which ends here.
In action, see the demo program below.
#! /usr/bin/env perl
use strict;
use warnings;
my $unbalanced_quote_first = qr/
^ ([^"]*)
"
( (?: [^"]* " [^"]* " [^"]* )* )
([^"]*)
$
/x;
*ARGV = *DATA; # for demo only
while (<>) {
chomp;
my @fields = split /\|/;
s/$unbalanced_quote_first/$1 $2$3/ for @fields;
print join("|", @fields), "|\n";
}
__DATA__
name |age|role |experience|
John |25 |Developer |2.56 |
Scot"x|30 |Tester |5.2 |
Jim |28 |DBA |3.0 |
Mike" |35 |"Consultant"|10.0 |
Daniel|26 |Developer |3.2 |
Paul" |29 |Tester " |3.6 |
Peter |30 |Developer |6.5 |
"Mike |42 |""Quoter"" |17.0 |
"Bill"|25 |"S"kilz" |2".5 |
Betty |26 |"Hi" "Bye" |3.2 |
Its output:
name |age|role |experience| John |25 |Developer |2.56 | Scot x|30 |Tester |5.2 | Jim |28 |DBA |3.0 | Mike |35 |"Consultant"|10.0 | Daniel|26 |Developer |3.2 | Paul |29 |Tester |3.6 | Peter |30 |Developer |6.5 | Mike |42 |""Quoter"" |17.0 | "Bill"|25 | S"kilz" |2 .5 | Betty |26 |"Hi" "Bye" |3.2 |
To replace the last double quote with a space, the substitution becomes
s/$unbalanced_quote_last/$1$2 $3/;
Note the movement of the space on the right-hand side. Its output is
name |age|role |experience| John |25 |Developer |2.56 | Scot x|30 |Tester |5.2 | Jim |28 |DBA |3.0 | Mike |35 |"Consultant"|10.0 | Daniel|26 |Developer |3.2 | Paul |29 |Tester |3.6 | Peter |30 |Developer |6.5 | Mike |42 |""Quoter"" |17.0 | "Bill"|25 |"S"kilz |2 .5 | Betty |26 |"Hi" "Bye" |3.2 |
The only difference between the two is on the result for Bill.
Upvotes: 2
Reputation: 133546
Could you please try following. considering that you need to edit only those fields which are NOT having starting "
and ending "
in their value(tested and written with provided samples only).
awk '
BEGIN{
FS=OFS="|"
}
{
for(i=1;i<=NF;i++){
if($i!~/^".*"$/){
gsub(/"/," ",$i)
}
}
}
1
' Input_file
Upvotes: 3