Remove duplicate lines on file by substring - preserve order (PERL)

Question

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :

1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

I want to perform two operations :

Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.

I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.

The code i have so far is :

open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;

while ($line_in = <$file_hndl_in>)
{
    # Calculate n. of delimiters
    my $delim_cur_line = $line_in =~ y/"$delimiter"//;
    # print "$commas 
" 

   if ( $delim_cur_line != $delim_amnt_per_line )
   {
      print {$file_hndl_ot_control} "$line_in";  
   }

   # Remove duplicates by substr(0,7) maintain order
   my substr_in = substr $line_in, 0, 11;
   print if not $lines{$substr_in}++;

}

And i want the file.output file to look like

1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

and the file.control file to look like :

(assuming delimiter control number is 6)

4352342xx23232xxx345545x45454x23232xxx

Could someone assist me? Thank you.

Posting edits : Tried code

my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;



open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");

while ( <> ) {

    my $count = ($_ =~ y/x//);
    print  "$count 
";
    # print $_;

    if ( $count != $delim_amnt_per_line )
    {
        print fh1 $_;
    }


    my ($prefix) = substr $_, 0, 7;
    next if $seen{$prefix}++;

    print fh2;
}

I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.

Borodin · Accepted Answer

It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar

The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_

It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as

$ perl filter.pl my_input.file > my_output.file

use strict;
use warnings 'all';

my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;

eval "sub delimiters { tr/$delimiter// }";

while ( <> ) {
    next if delimiters() == $delim_amnt_per_line;

    my ($prefix) = substr $_, 0, 7;
    next if $seen{$prefix}++;

    print;
}

output

1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

Remove duplicate lines on file by substring - preserve order (PERL)

Answers (1)

output

Related Questions