faujong
faujong

Reputation: 1127

Using Perl or Powershell, how to compare 2 CSV files and get only the new rows?

I am comparing two large comma-delimited CSV files File1.csv and File2.csv using the Text::Diff Perl module. The Perl program is called from a .bat file and I put the result in a third file Diff.csv

Perl

#!/usr/bin/env perl

use strict;
use warnings;

use Text::Diff;

my $diffs = diff $ARGV[0] => $ARGV[1];

$diffs =~ s/^(?:[^\n]*+\n){2}//;
$diffs =~ s/^(?:[\@ ][^\n]*+)?+\n//mg;

print $diffs;

This is how I call the Perl script:

perl "C:\diffBetweenTwoFiles.pl" "C:\File1.csv" "C:\File2.csv" > "C:\Diff.csv"

One of the columns in the CSV file is Name.

Currently the result lists all rows whose values in any columns change, but I want only to list new Name rows.

For example:

File1.csv

"Name","DOB","Address"
"One","1/1/01","5 Stock Rd"
"Two","1/2/02","1 Research Rd"

File2.csv

"Name","DOB","Address"
"One","1/1/01","5 Stock Rd"
"Two","1/2/02","111 Research Rd"
"Three","1/3/03","3 Bold Rd"

Currently, the result list these (it includes "Two" because its Address changed):

"Name","DOB","Address"
"Two","1/2/02","111 Research Rd"
"Three","1/3/03","3 Bold Rd"

But, I only want the result to list the new "Name" like this:

"Name","DOB","Address"
"Three","1/3/03","3 Bold Rd"

How can I do that in Perl or Powershell script?

Upvotes: 2

Views: 477

Answers (2)

zdim
zdim

Reputation: 66891

Use Text::CSV in Perl

use warnings;
use strict;
use feature 'say';
use Text::CSV;

my ($file_old, $file_new, $file_diff) = 
    map { $_ . '.csv' } qw(File1 File2 Diff);

my $csv = Text::CSV->new ( { binary => 1 } )
    or die "Cannot use CSV: ".Text::CSV->error_diag();

my ($old, $header) = get_lines($csv, $file_old, 1);
my $new            = get_lines($csv, $file_new);

my @lines_with_new_names = @{ new_names($old, $new) };

open my $fh, '>', $file_diff  or die "Can't open $file_diff: $!";
$csv->say($fh, $header);
$csv->say($fh, $_) for @lines_with_new_names;  # or print with eol set

sub new_names {
    my ($old, $new) = @_;
    my %old = map { $_->[0] => 1 } @$old;
    return [ map { (!exists $old{$_->[0]}) ? $_ : () } @$new ];
}

sub get_lines {
    my ($csv, $file, $return_header) = @_;
    open my $fh, '<', $file or die "Can't open $file $!";
    my $header = $csv->getline($fh);  # remove the header line
    return ($return_header) 
        ? ( $csv->getline_all($fh), $header )
        :   $csv->getline_all($fh);
}

This prints the correct difference with the provided samples.

Variable names tagged with old are related to the file with fewer lines, the other one being new. The "Name" column is taken to be the first one.

Comments

  • The getline_all method returns an arrayref for all lines, where each is an arrayref with all fields. This is done from a sub, with an option to return the header line as well.

  • The optional return of another variable here makes a difference of whether a single scalar or a list is returned, so it can also be handled using wantarray builtin

    return wantarray ? ( LIST ) : scalar;
    

    which returns true if the sub is called in a list context. Thus the caller decides by invoking the sub in either the list or scalar context, my ($v1, $v2) = f(...) or my $v = f(...), in which case a flag is not needed in the call. I opted for a more explicit way.

  • The difference in names' list is produced in new_names sub. First a lookup hash is made with all names from the "old" arrayref. Then lines in "new" arrayref are filtered, taking those which don't have a name in the "old" (no such key in the hash), and returned in an arrayref [].

    Such use of a hash is a standard technique for finding differences between arrays.

The documented method say used for printing doesn't work on my older version of the module with which this is tested. In that case use print and set eol in the constructor.

Upvotes: 1

goTo-devNull
goTo-devNull

Reputation: 9372

Since you're working with large files that are stressing your memory limit, you can try:

  1. Read the first CSV file one line at a time, and use a hashtable to store the file's Name entries.
  2. Read the second CSV file one line at a time and compare it's Name entries against the first.

(UPDATED based on comments) A simple example in PowerShell:

$output = New-Object System.Text.StringBuilder;
$file1 = @{};
$header = $null;

# $filePaths is two-element array with full path to CSV files
for ($i = 0; $i -lt $filePaths.Length; ++$i) {
    $reader = New-Object System.IO.StreamReader($filePaths[$i]);
    while (($line = $reader.ReadLine()) -ne $null) {
        if ($line -match '\S') {
            if ($header -eq $null) { 
                $header = $line;
                $output.AppendLine($line) | Out-Null; 
            }
            $name = ($line -split ',')[0];
            switch ($i) {
                0 { $file1.Add($name, $null); }
                1 { 
                    if (!$file1.ContainsKey($name)) { 
                        $output.AppendLine($line) | Out-Null; 
                    } 
                }
            }
        }
    }
    $reader.Dispose();
}
$output.ToString() | Out-File -FilePath $outPath;

Upvotes: 1

Related Questions