Reputation: 11
perl -i -pe 's/(,\h*"[^\n"]*)\n/$1 /g' /opt/data-integration/transfer/events/processing/Master_Events_List.csv
What is going on here? I tried a translator but its a bit vague. What are some examples that might return here?
Upvotes: 0
Views: 135
Reputation:
The best way to fix newlines in quoted fields that masquerade as End-Of-Record :
First, don't try and manipulate CSV (or XML or HTML) with modules. While CSV might seem tricky, it is extremely simple. Don't use Text::CSV. Instead, use a substitute regex with a callback.
Also, you can use the regex to just correctly parse a csv without replacing
newlines, but you probably want to use Perl to fix it for use in some other language.
Regex (with trim)
/((?:^|,|\r?\n))\s*(?:("[^"\\]*(?:\\[\S\s][^"\\]*)*"[^\S\r\n]*(?=$|,|\r?\n))|([^,\r\n]*(?=$|,|\r?\n)))/
Explained
( # (1 start), Delimiter (comma or newline)
(?: ^ | , | \r? \n )
) # (1 end)
\s* # Leading optional whitespaces ( this is for trim )
# ( if no trim is desired, remove this, add
# [^\S\r\n]* to end of group 1 )
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)
(Note - this requires a script.)
Perl sample
use strict;
use warnings;
$/ = undef;
sub RmvNLs {
my ($delim, $quote, $non_quote) = @_;
if ( defined $non_quote ) {
return $delim . $non_quote;
}
$quote =~ s/\s*\r?\n/ /g;
return $delim . $quote;
}
my $csv = <DATA>;
$csv =~ s/
( # (1 start), Delimiter (comma or newline)
(?: ^ | , | \r? \n )
) # (1 end)
\s* # Leading optional whitespaces ( this is for trim )
# ( if no trim is desired, remove this, add [^\S\r\n]* to end of group 1 )
(?:
( # (2 start), Quoted string field
" # Quoted string
[^"\\]*
(?: \\ [\S\s] [^"\\]* )*
"
[^\S\r\n]* # Trailing optional horizontal whitespaces
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (2 end)
| # OR
( # (3 start), Non quoted field
[^,\r\n]* # Not comma or newline
(?= $ | , | \r? \n ) # Delimiter ahead (EOS, comma or newline)
) # (3 end)
)
/RmvNLs($1,$2,$3)/xeg;
print $csv;
__DATA__
497,50,2008-08-02T16:56:53Z,469,4,
"foo bar
foo
bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar
bar"
hello
world
"asdfas"
ID,NAME,TITLE,DESCRIPTION,,
PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING
HARD ON BEING A GOOD
SERVENT."
PRO1235, "KEITH SMITH",ENGINEER,"keith has been working
hard on being a good
servent."
PRO1235,"KENNY SMITH",,"keith has been working
hard on being a good
servent."
PRO1235,"RICK SMITH",,, #
Output
497,50,2008-08-02T16:56:53Z,469,4,"foo bar foo bar"
518,153,2008-08-02T17:42:28Z,469,2,"foo bar bar"
hello
world
"asdfas"
ID,NAME,TITLE,DESCRIPTION,,PRO1234,"JOHN SMITH",ENGINEER,"JOHN HAS BEEN WORKING HARD ON BEING A GOOD SERVENT."
PRO1235,"KEITH SMITH",ENGINEER,"keith has been working hard on being a good servent."
PRO1235,"KENNY SMITH",,"keith has been working hard on being a good servent."
PRO1235,"RICK SMITH",,,
Upvotes: 0
Reputation: 2945
Edited this to second Schwern (also upvoted): regular expressions seem to be a poor fit for manipulating CSV.
As for the regular expression in question, let's dissect it. Starting with the top level:
's/(,\h*"[^\n"]*)\n/$1 /g'
The s/part1/part2/g
expression means "substitute the first part with the second part everywhere".
Now let's examing the "first part":
(,\h*"[^\n"]*)\n
The parentheses are enclosing a group. There is only one group, so it becomes group number 1. We'll come back to that in the next step.
Then, check out https://perldoc.perl.org/perlrebackslash.html for explanation of the character classes. \h
is a horizontal whitespace and \n
is a logical newline character.
The expression inside the group is stating: "starts with a comma, then any number of horizontal whitespace characters, then anything but a newline and quote; finally, there must be a trailing newline". So it is basically a comma follwed by a csv field.
Lastly, the "second part" reads:
$1
This is just a reference to the group number 1 that was captured earlier followed by a space.
In overall, the whole expression replaces a trailing string field that is not terminated with a quote and removing it's newline terminator.
Upvotes: 0
Reputation: 164769
First, don't try and manipulate CSV (or XML or HTML) with regexes. While CSV might seem simple, it can be subtle. Instead use Text::CSV. The exception is if your CSV is malformed and you're fixing it.
Now, for what your regex is doing. First, let's translate it it from s//
to s{}{}
which is a bit easier on the eyes and use \x
so we can space things out a bit.
s{
# Capture to $1
(
# A comma.
,
# 0 or more `h` "horizontal whitespace": tabs and spaces
\h*
# A quote.
"
# 0 or more of anything which is not a quote or newline.
[^\n"]*
)
# A newline (not captured)
\n
}
# Put the captured bit in with a space after it.
# The `g` says to do it multiple times over the whole string.
{$1 }gx
It will change foo, "bar\n
into foo, "bar
. I'm guessing it's turning text fields in the CSV with newlines in them into ones with just spaces.
foo, "first
field", "second
field"
Will become
foo, "first field", "second field"
This is something better handled with Text::CSV. I suspect the purpose of the transform is to help out CSV parsers which cannot handle newlines. Text::CSV can with a little coercing.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
use autodie;
use Text::CSV;
use IO::Scalar;
use Data::Dumper;
# Pretend our scalar is an IO object so we can use `getline`.
my $str = qq[foo, "bar", "this\nthat"\n];
my $io = IO::Scalar->new(\$str);
# Configure Text::CSV
my $csv = Text::CSV->new({
# Embedded newlines normally aren't allowed, this tells Text::CSV to
# treat the content as binary instead.
binary=> 1,
# Allow spaces between the cells.
allow_whitespace => 1
});
# Use Text::CSV->getline() to do the parsing.
while( my $row = $csv->getline($io) ) {
# Dump the contents of the row
say Dumper $row;
}
And it will correctly parse the row and its embedded newlines.
$VAR1 = [
'foo',
'bar',
'this
that'
];
Upvotes: 3