Apemantus
Apemantus

Reputation: 683

Parsing a very large text file on Windows

I've got a 2GB text file and a 500MB text file. The 2GB is in a slightly daft format: e.g. sample:

CD 15
IG ABH
NU 1223
**
CD 17
IG RFT
NU 3254
**

Where ** is the marker between records.

I need to extract all the values of NU where CD is a certain value; I then need to go through the 500MB text file and then match all the records in there with the NU values from the 2GB file and then write those to a new file.

I know PHP. This is trivial in PHP, apart from the size of the file. Even using fgets to read a line at a time doesn't really work as it takes for ever and then crashes my computer in localhost (under XAMPP apache.exe grows to use up all system memory). Plus doing it in PHP would be a pain (it's for non-technical people to run, so they'd need to download the 2GB and 500MB from the FTP server when they become available each week; upload them to my FTP server which is flaky on such large file sizes; run a script on my server that takes ages etc).

I know a bit of VBScript, no Perl, no .NET, no C# etc. How can I write a Windows-based programme that will run locally, load the files a line at a time, and not crash due to the file size?

Upvotes: 2

Views: 1191

Answers (3)

ikegami
ikegami

Reputation: 386551

The following will create a hash (a type of associative array) with one (small) element for each NU to find in the second file. How big that hash will be depends on how many matching records you have in the first file.

If that still takes too much memory, break the first file down into smaller parts, run the program more than once, and concatenate the results.

use strict;
use warnings;

my $qfn_idx = '...';
my $qfn_in  = '...';
my $qfn_out = '...';

my $cd_to_match = ...;

my %nus;
{
   open(my $fh_idx, '<', $qfn_idx)
      or die("Can't open \"$qfn_idx\": $!\n");

   local $/ = "\n**\n";
   while (<$fh_idx>) {
      next if !( my ($cd) = /^CD ([0-9]+)/m );
      next if $cd != $cd_to_match;
      next if !( my ($nu) = /^NU ([0-9]+)/m );
      ++$nus{$nu};
   }
}

{
   open(my $fh_in, '<', $qfn_in)
      or die("Can't open \"$qfn_in\": $!\n");
   open(my $fh_out, '>', $qfn_out)
      or die("Can't create \"$qfn_out\": $!\n");

   local $/ = "\n**\n";
   while (<$fh_in>) {
      next if !( my ($nu) = /^NU ([0-9]+)/m );
      next if !$nus{$nu};
      print($fh_out $_);
   }
}

Upvotes: 2

TLP
TLP

Reputation: 67910

Basically the same idea as ikegami's, but with a subroutine and some handy argument handling.

The basic idea is to read in a complete record by setting the input record separator $/ to your record separator, "\n**\n", turn that record into a hash, save the NU values and use them for later lookup. Note the usage of switching mode by eof.

I did hardcode the input for CD, but changing it to my $CD = shift; will allow you to do:

script.pl 15 CD.txt NU.txt > outputfile

I am not overly fond of using the input record separator, as it is rather inflexible and sensitive to data corruption, such as missing newlines at eof. But as long as data is consistent, there should be no problem.

Usage:

script.pl CD.txt NU.txt > outputfile

Where CD.txt is the file where you extract the NU values to look up in NU.txt.

Code:

use strict;
use warnings;

my $CD = 15;
my %NU;
my $read = 1;
local $/ = "\n**\n";
while (<>) {
    next unless /\S/; # no blank lines
    my %check = record($_);
    if ($read) {
        if ($check{'CD'} == $CD) {
            $NU{$check{'NU'}}++;
        }
    } else {
        if ($NU{$check{'NU'}}) {
            print;
        }
    }
    $read &&= eof;
}

sub record {
    my $str = shift;
    chomp $str;  # remove record separator **
    return map(split(/ /, $_, 2), split(/\n/, $str));
}

Upvotes: 0

Stephen Quan
Stephen Quan

Reputation: 26289

The following declares a VBScript function to read a source file 1 line at a time and write the destination file only if the cdfilter string matches the cd in the record:

Option Explicit

Const ForReading = 1
Const ForWriting = 2

Sub Extract(srcpath, dstpath, cdfilter)
  Dim fso, src, dst, txt, cd, nu
  Set fso = CreateObject("Scripting.FileSystemObject")
  Set src = fso.OpenTextFile(srcpath, ForReading)
  Set dst = fso.OpenTextFile(dstpath, ForWriting, True)
  While (not src.AtEndOfStream)
    txt = ""
    While (not src.AtEndOfStream) and (txt <> "**")
      txt = src.ReadLine
      If Left(txt, 3) = "CD " Then
        cd = mid(txt, 4)
      End If
      If Left(txt, 3) = "NU " Then
        nu = mid(txt, 4)
      End If
      If txt = "**" Then
        If cd = cdfilter Then
          dst.WriteLine nu
          cd = ""
          nu = ""
        End If
      End If
    Wend
  Wend
End Sub

Convert "input.txt", "output.txt", "17"

Upvotes: 0

Related Questions