Reputation: 683
I've got a 2GB text file and a 500MB text file. The 2GB is in a slightly daft format: e.g. sample:
CD 15 IG ABH NU 1223 ** CD 17 IG RFT NU 3254 **
Where ** is the marker between records.
I need to extract all the values of NU where CD is a certain value; I then need to go through the 500MB text file and then match all the records in there with the NU values from the 2GB file and then write those to a new file.
I know PHP. This is trivial in PHP, apart from the size of the file. Even using fgets to read a line at a time doesn't really work as it takes for ever and then crashes my computer in localhost (under XAMPP apache.exe grows to use up all system memory). Plus doing it in PHP would be a pain (it's for non-technical people to run, so they'd need to download the 2GB and 500MB from the FTP server when they become available each week; upload them to my FTP server which is flaky on such large file sizes; run a script on my server that takes ages etc).
I know a bit of VBScript, no Perl, no .NET, no C# etc. How can I write a Windows-based programme that will run locally, load the files a line at a time, and not crash due to the file size?
Upvotes: 2
Views: 1191
Reputation: 386551
The following will create a hash (a type of associative array) with one (small) element for each NU to find in the second file. How big that hash will be depends on how many matching records you have in the first file.
If that still takes too much memory, break the first file down into smaller parts, run the program more than once, and concatenate the results.
use strict;
use warnings;
my $qfn_idx = '...';
my $qfn_in = '...';
my $qfn_out = '...';
my $cd_to_match = ...;
my %nus;
{
open(my $fh_idx, '<', $qfn_idx)
or die("Can't open \"$qfn_idx\": $!\n");
local $/ = "\n**\n";
while (<$fh_idx>) {
next if !( my ($cd) = /^CD ([0-9]+)/m );
next if $cd != $cd_to_match;
next if !( my ($nu) = /^NU ([0-9]+)/m );
++$nus{$nu};
}
}
{
open(my $fh_in, '<', $qfn_in)
or die("Can't open \"$qfn_in\": $!\n");
open(my $fh_out, '>', $qfn_out)
or die("Can't create \"$qfn_out\": $!\n");
local $/ = "\n**\n";
while (<$fh_in>) {
next if !( my ($nu) = /^NU ([0-9]+)/m );
next if !$nus{$nu};
print($fh_out $_);
}
}
Upvotes: 2
Reputation: 67910
Basically the same idea as ikegami's, but with a subroutine and some handy argument handling.
The basic idea is to read in a complete record by setting the input record separator $/
to your record separator, "\n**\n"
, turn that record into a hash, save the NU
values and use them for later lookup. Note the usage of switching mode by eof
.
I did hardcode the input for CD
, but changing it to my $CD = shift;
will allow you to do:
script.pl 15 CD.txt NU.txt > outputfile
I am not overly fond of using the input record separator, as it is rather inflexible and sensitive to data corruption, such as missing newlines at eof. But as long as data is consistent, there should be no problem.
Usage:
script.pl CD.txt NU.txt > outputfile
Where CD.txt
is the file where you extract the NU
values to look up in NU.txt
.
Code:
use strict;
use warnings;
my $CD = 15;
my %NU;
my $read = 1;
local $/ = "\n**\n";
while (<>) {
next unless /\S/; # no blank lines
my %check = record($_);
if ($read) {
if ($check{'CD'} == $CD) {
$NU{$check{'NU'}}++;
}
} else {
if ($NU{$check{'NU'}}) {
print;
}
}
$read &&= eof;
}
sub record {
my $str = shift;
chomp $str; # remove record separator **
return map(split(/ /, $_, 2), split(/\n/, $str));
}
Upvotes: 0
Reputation: 26289
The following declares a VBScript function to read a source file 1 line at a time and write the destination file only if the cdfilter string matches the cd in the record:
Option Explicit
Const ForReading = 1
Const ForWriting = 2
Sub Extract(srcpath, dstpath, cdfilter)
Dim fso, src, dst, txt, cd, nu
Set fso = CreateObject("Scripting.FileSystemObject")
Set src = fso.OpenTextFile(srcpath, ForReading)
Set dst = fso.OpenTextFile(dstpath, ForWriting, True)
While (not src.AtEndOfStream)
txt = ""
While (not src.AtEndOfStream) and (txt <> "**")
txt = src.ReadLine
If Left(txt, 3) = "CD " Then
cd = mid(txt, 4)
End If
If Left(txt, 3) = "NU " Then
nu = mid(txt, 4)
End If
If txt = "**" Then
If cd = cdfilter Then
dst.WriteLine nu
cd = ""
nu = ""
End If
End If
Wend
Wend
End Sub
Convert "input.txt", "output.txt", "17"
Upvotes: 0