user1562471
user1562471

Reputation: 33

Perl split a text file into chunks

I have a large txt file made of thousand of articles and I am trying to split it into individual files - one for each of the articles that I'd like to save as article_1, article_2 etc.. Each articles begins by a line containing the word /DOCUMENTS/. I am totally new to perl and any insight would be so great ! (even advice on good doc websites). Thanks a lot. So far what I have tried look like:

#!/usr/bin/perl
use warnings;
use strict;

my $id = 0;
my $source = "2010_FTOL_GRbis.txt";
my $destination = "file$id.txt";

open IN, $source or die "can t read $source: $!\n";

while (<IN>)
  {
    {  
      open OUT, ">$destination" or die "can t write $destination: $!\n";
      if (/DOCUMENTS/)
       {
         close OUT ;
         $id++;
       }
    }
  }
close IN;

Upvotes: 3

Views: 3537

Answers (2)

Axeman
Axeman

Reputation: 29854

Let's say that /DOCUMENTS/ appears by itself on a line. Thus you can make that the record separator.

use English     qw<$RS>;
use File::Slurp qw<write_file>;
my $id     = 0;
my $source = "2010_FTOL_GRbis.txt";

{   local $RS = "\n/DOCUMENTS/\n";
    open my $in, $source or die "can t read $source: $!\n";
    while ( <$in> ) { 
        chomp; # removes the line "\n/DOCUMENTS/\n"
        write_file( 'file' . ( ++$id ) . '.txt', $_ );
    }
    # being scoped by the surrounding brackets (my "local block"),
    close $in;    # an explicit close is not necessary
}

NOTES:

  • use English declares the global variable $RS. The "messy name" for it is $/. See perldoc perlvar
  • A line separator is the default record separator. That is, the standard unit of file reading is a record. Which is only, by default, a "line".
  • As you will find in the linked documentation, $RS only takes literal strings. So, using the idea that the division between articles was '/DOCUMENTS/' all by itself on a line, I specified newline + '/DOCUMENTS/' + newline. If this is part of a path that occurs somewhere on the line, then that particular value will not work for the record separator.

Upvotes: 4

gaussblurinc
gaussblurinc

Reputation: 3682

Did you read Programming Perl? It is the best book for beginning!

I don't understand what you are trying to do. I assume you have text that has articles and want to get all articles in separate files.

use warnings;
use strict;
use autodie qw(:all);

my $id          = 0;
my $source      = "2010_FTOL_GRbis.txt";
my $destination = "file$id.txt";

open my $IN, '<', $source;
#open first file
open my $OUT, '>', $destination;

while (<$IN>) {
    chomp;    # kill \n at the end
    if ($_ eq '/DOCUMENTS/') {  # not sure, am i right here or what you looking for
        close OUT;
        $id++;
        $destination = "file$id.txt";
        open my $OUT, '>', $destination;
    } else {
        print {$OUT} $_, "\n";     # print into file with $id name (as you open above)
    }
}
close $IN;

Upvotes: 2

Related Questions