stack0114106
stack0114106

Reputation: 8711

Fix unbalanced quotes within fields not across records

I have a pipe separated file like below, where the double quotes are not balanced within fields, I need to fix them

name  |age|role        |experience|
John  |25 |Developer   |2.56      |
Scot"x|30 |Tester      |5.2       |
Jim   |28 |DBA         |3.0       |
Mike" |35 |"Consultant"|10.0      |
Daniel|26 |Developer   |3.2       |
Paul" |29 |Tester  "   |3.6       |
Peter |30 |Developer   |6.5       |

Required output:

name  |age|role        |experience|
John  |25 |Developer   |2.56      |
Scot x|30 |Tester      |5.2       |
Jim   |28 |DBA         |3.0       |
Mike  |35 |"Consultant"|10.0      |
Daniel|26 |Developer   |3.2       |
Paul  |29 |Tester      |3.6       |
Peter |30 |Developer   |6.5       |

I tried below command, but it is not giving expected results..

perl -ne ' { while(/\x22/g) {$c++} ; if($c%2!=0) { s/\x22//g ; print} else { print } ; $c=0 } ' file

Upvotes: 1

Views: 1215

Answers (4)

Sundeep
Sundeep

Reputation: 23677

Assuming you only have one double quote in problematic fields and no | character inside fields:

$ perl -lpe 's/(^|\|)[^"|]*\K"(?=[^|"]*(\||$))/ /g' ip.txt
name  |age|role        |experience|
John  |25 |Developer   |2.56      |
Scot x|30 |Tester      |5.2       |
Jim   |28 |DBA         |3.0       |
Mike  |35 |"Consultant"|10.0      |
Daniel|26 |Developer   |3.2       |
Paul  |29 |Tester      |3.6       |
Peter |30 |Developer   |6.5       |
  • (^|\|) beginning of line or field separator |
  • [^"|]* zero or more of non-double quote and non | characters
  • \K so that the characters so far isn't part of matched portion, something like lookbehind
  • " match double quote character
  • (?=[^|"]*(\||$)) to ensure that rest of the field doesn't have double quote character
    • can also use (?=[^|"]*\|) as all the fields in given sample ends with |

Upvotes: 2

Polar Bear
Polar Bear

Reputation: 6798

Something like this should do it

use strict;
use warnings;

while( <DATA> ) {
    chomp;
    my @fields = split /\|/;

    map { my $c = () = $_ =~ /"/g; s/"/ / if $c == 1 } @fields;

    print join('|', @fields) . "|\n";   
}

__DATA__
name  |age|role        |experience|
John  |25 |Developer   |2.56      |
Scot"x|30 |Tester      |5.2       |
Jim   |28 |DBA         |3.0       |
Mike" |35 |"Consultant"|10.0      |
Daniel|26 |Developer   |3.2       |
Paul" |29 |Tester  "   |3.6       |
Peter |30 |Developer   |6.5       |

Upvotes: 1

Greg Bacon
Greg Bacon

Reputation: 139531

Determining whether (ASCII) double quotes are balanced is a matter of counting whether they are even in number. Balancing “ and ” requires recursive subpatterns.

A design decision remains. In the case of unbalanced double quotes, which do you drop? If the first, then think of it as an odd quote followed by some even number of quotes. Translated from English to regex:

my $unbalanced_quote_first = qr/
  ^ ([^"]*)
    "
    ( (?: [^"]* " [^"]* " [^"]* )* )
    ([^"]*)
  $
/x;

The last is an even number of double quotes followed by the odd quote.

my $unbalanced_quote_last = qr/
  ^ ([^"]*)
    ( (?: [^"]* " [^"]* " [^"]* )* )
    "
    ([^"]*)
  $
/x;

Those who want to remove a quote somewhere in the middle are truly sick puppies, consideration of which ends here.

In action, see the demo program below.

#! /usr/bin/env perl

use strict;
use warnings;

my $unbalanced_quote_first = qr/
  ^ ([^"]*)
    "
    ( (?: [^"]* " [^"]* " [^"]* )* )
    ([^"]*)
  $
/x;

*ARGV = *DATA;  # for demo only
while (<>) {
  chomp;
  my @fields = split /\|/;
  s/$unbalanced_quote_first/$1 $2$3/ for @fields;
  print join("|", @fields), "|\n";
}

__DATA__
name  |age|role        |experience|
John  |25 |Developer   |2.56      |
Scot"x|30 |Tester      |5.2       |
Jim   |28 |DBA         |3.0       |
Mike" |35 |"Consultant"|10.0      |
Daniel|26 |Developer   |3.2       |
Paul" |29 |Tester  "   |3.6       |
Peter |30 |Developer   |6.5       |
"Mike |42 |""Quoter""  |17.0      |
"Bill"|25 |"S"kilz"    |2".5      |
Betty |26 |"Hi" "Bye"  |3.2       |

Its output:

name  |age|role        |experience|
John  |25 |Developer   |2.56      |
Scot x|30 |Tester      |5.2       |
Jim   |28 |DBA         |3.0       |
Mike  |35 |"Consultant"|10.0      |
Daniel|26 |Developer   |3.2       |
Paul  |29 |Tester      |3.6       |
Peter |30 |Developer   |6.5       |
 Mike |42 |""Quoter""  |17.0      |
"Bill"|25 | S"kilz"    |2 .5      |
Betty |26 |"Hi" "Bye"  |3.2       |

To replace the last double quote with a space, the substitution becomes

s/$unbalanced_quote_last/$1$2 $3/;

Note the movement of the space on the right-hand side. Its output is

name  |age|role        |experience|
John  |25 |Developer   |2.56      |
Scot x|30 |Tester      |5.2       |
Jim   |28 |DBA         |3.0       |
Mike  |35 |"Consultant"|10.0      |
Daniel|26 |Developer   |3.2       |
Paul  |29 |Tester      |3.6       |
Peter |30 |Developer   |6.5       |
 Mike |42 |""Quoter""  |17.0      |
"Bill"|25 |"S"kilz     |2 .5      |
Betty |26 |"Hi" "Bye"  |3.2       |

The only difference between the two is on the result for Bill.

Upvotes: 2

RavinderSingh13
RavinderSingh13

Reputation: 133546

Could you please try following. considering that you need to edit only those fields which are NOT having starting " and ending " in their value(tested and written with provided samples only).

awk '
BEGIN{
  FS=OFS="|"
}
{
  for(i=1;i<=NF;i++){
    if($i!~/^".*"$/){
      gsub(/"/," ",$i)
    }
  }
}
1
' Input_file

Upvotes: 3

Related Questions