Reputation: 81

parse a huge text file in perl

I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.

ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487

This is how my model file looks, I want to have them as :

File A : 
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853

File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948

File C:

ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.

Is there any easy way of doing this than going thorough an array?

How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.

Thanks Sipra

Upvotes: 3

Answers (4)

David W.

Reputation: 107080

You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?

I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.

I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.

It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.

If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.

The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.

I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.

First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.

I see some sort of processing loop like this:

while (my $line = <$input_fh>) {   #Line from the input file.
   chomp $line;
   my @input_line_array = split /\t/, $line;
   my $fileHandle;
   foreach my $output_file (@outputFileList) {  #List of output files.
       $fileHandle = $output_file->{FILE_HANDLE};
       my @fieldsToWrite;
       foreach my $fieldNumber (@{$output_file->{FIELD_LIST}}) {
          push $fieldsToWrite, $input_line_array[$field];
       }
       say $file_handle join "\t", @fieldsToWrite;
   }
}

I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the @input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.

I have a list called @outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in @input_line_array. So if

$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];

Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.

I hope this is making some sense.

The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.

Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.

Upvotes: 0

jchips12

Reputation: 1227

This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)

use strict;
use warnings;

use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );

#I'm lazy
my @source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
  or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;

#read each row
while ( my $raw_line = shift @source_file ) {
    $csv->parse($raw_line);
    my @fields = $csv->fields();

    #get the first 2 ids
    my @ids = splice @fields, 0, 2;

    my $group = 0;
    while (@fields) {
        #get the first 8 columns
        my @columns = splice @fields, 0, 8;
        #if you want to change the separator of the output replace ',' with "\t"
        push @{ $output_file->[$group] }, (join ',', @ids, @columns), $/;
        $group++;
    }
}

#for filename purposes
my $letter = 65;
foreach my $data (@$output_file) {
    my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
    write_file( $output_filename, @$data );
    $letter++;
}

#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (@$output_file) {
#    my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
#    write_file( $output_filename, @$data );
#    $sample_number++;
#}

Upvotes: 2

SAN

Reputation: 2247

Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files

perl -F, -lane 'print "@F[0..1] @F[2..9]"' <INPUT_FILE_NAME>

Upvotes: 0

Dave Cross

Reputation: 69314

#!/bin/env perl

use strict;
use warnings;

# open three output filehandles
my %fh;
for (qw[A B C]) {
  open $fh{$_}, '>', "file$_" or die $!;
}

# open input
open my $in, '<', 'somefile' or die $!;

# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;

while (<$in>) {
  chomp;
  my @data = split /,/;

  print $fh{A} join(',', @data[0 .. 9]), "\n";
  print $fh{B} join(',', @data[0, 1, 10 .. 17]), "\n";
  print $fh{C} join(',', @data[0, 1, 18 .. $#data]), "\n";
}

Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.

#!/usr/bin/env perl

use strict;
use warnings;

# open input
open my $in, '<', 'somefile' or die $!;

chomp(my $head = <$in>);
my @cols = split/,/, $head;

die 'Invalid number of records - ' . @cols . "\n"
  if (@cols -2) % 8;

my @files;
my $name = 'A';
foreach (1 .. (@cols - 2) / 8) {
   my %desc;
   $desc{start_col} = (($_ - 1) * 8) + 2;
   $desc{end_col}   = $desc{start_col} + 7;
   open $desc{fh}, '>', 'file' . $name++ or die $!;
   print {$desc{fh}} join(',', @cols[0,1],
                               @cols[$desc{start_col} .. $desc{end_col}]),
                     "\n";

   push @files, \%desc;
}

while (<$in>) {
  chomp;
  my @data = split /,/;

  foreach my $f (@files) {
    print {$f->{fh}} join(',', @data[0,1],
                               @data[$f->{start_col} .. $f->{end_col}]),
                   "\n";
   }
}

Upvotes: 8

parse a huge text file in perl

Answers (4)

Related Questions