Reputation: 1127
I am comparing two large comma-delimited CSV files File1.csv
and File2.csv
using the
Text::Diff
Perl module.
The Perl program is called from a .bat
file and I put the result in a third file Diff.csv
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Diff;
my $diffs = diff $ARGV[0] => $ARGV[1];
$diffs =~ s/^(?:[^\n]*+\n){2}//;
$diffs =~ s/^(?:[\@ ][^\n]*+)?+\n//mg;
print $diffs;
This is how I call the Perl script:
perl "C:\diffBetweenTwoFiles.pl" "C:\File1.csv" "C:\File2.csv" > "C:\Diff.csv"
One of the columns in the CSV file is Name
.
Currently the result lists all rows whose values in any columns change, but I want only to list new Name
rows.
For example:
"Name","DOB","Address"
"One","1/1/01","5 Stock Rd"
"Two","1/2/02","1 Research Rd"
"Name","DOB","Address"
"One","1/1/01","5 Stock Rd"
"Two","1/2/02","111 Research Rd"
"Three","1/3/03","3 Bold Rd"
Currently, the result list these (it includes "Two" because its Address changed):
"Name","DOB","Address"
"Two","1/2/02","111 Research Rd"
"Three","1/3/03","3 Bold Rd"
But, I only want the result to list the new "Name" like this:
"Name","DOB","Address"
"Three","1/3/03","3 Bold Rd"
How can I do that in Perl or Powershell script?
Upvotes: 2
Views: 477
Reputation: 66891
Use Text::CSV in Perl
use warnings;
use strict;
use feature 'say';
use Text::CSV;
my ($file_old, $file_new, $file_diff) =
map { $_ . '.csv' } qw(File1 File2 Diff);
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag();
my ($old, $header) = get_lines($csv, $file_old, 1);
my $new = get_lines($csv, $file_new);
my @lines_with_new_names = @{ new_names($old, $new) };
open my $fh, '>', $file_diff or die "Can't open $file_diff: $!";
$csv->say($fh, $header);
$csv->say($fh, $_) for @lines_with_new_names; # or print with eol set
sub new_names {
my ($old, $new) = @_;
my %old = map { $_->[0] => 1 } @$old;
return [ map { (!exists $old{$_->[0]}) ? $_ : () } @$new ];
}
sub get_lines {
my ($csv, $file, $return_header) = @_;
open my $fh, '<', $file or die "Can't open $file $!";
my $header = $csv->getline($fh); # remove the header line
return ($return_header)
? ( $csv->getline_all($fh), $header )
: $csv->getline_all($fh);
}
This prints the correct difference with the provided samples.
Variable names tagged with old
are related to the file with fewer lines, the other one being new
. The "Name" column is taken to be the first one.
Comments
The getline_all
method returns an arrayref for all lines, where each is an arrayref with all fields. This is done from a sub, with an option to return the header line as well.
The optional return of another variable here makes a difference of whether a single scalar or a list is returned, so it can also be handled using wantarray builtin
return wantarray ? ( LIST ) : scalar;
which returns true if the sub is called in a list context. Thus the caller decides by invoking the sub in either the list or scalar context, my ($v1, $v2) = f(...)
or my $v = f(...)
, in which case a flag is not needed in the call. I opted for a more explicit way.
The difference in names' list is produced in new_names
sub. First a lookup hash is made with all names from the "old" arrayref. Then lines in "new" arrayref are filtered, taking those which don't have a name in the "old" (no such key in the hash), and returned in an arrayref []
.
Such use of a hash is a standard technique for finding differences between arrays.
The documented method say
used for printing doesn't work on my older version of the module with which this is tested. In that case use print
and set eol in the constructor.
Upvotes: 1
Reputation: 9372
Since you're working with large files that are stressing your memory limit, you can try:
(UPDATED based on comments) A simple example in PowerShell:
$output = New-Object System.Text.StringBuilder;
$file1 = @{};
$header = $null;
# $filePaths is two-element array with full path to CSV files
for ($i = 0; $i -lt $filePaths.Length; ++$i) {
$reader = New-Object System.IO.StreamReader($filePaths[$i]);
while (($line = $reader.ReadLine()) -ne $null) {
if ($line -match '\S') {
if ($header -eq $null) {
$header = $line;
$output.AppendLine($line) | Out-Null;
}
$name = ($line -split ',')[0];
switch ($i) {
0 { $file1.Add($name, $null); }
1 {
if (!$file1.ContainsKey($name)) {
$output.AppendLine($line) | Out-Null;
}
}
}
}
}
$reader.Dispose();
}
$output.ToString() | Out-File -FilePath $outPath;
Upvotes: 1