Nari
Nari

Reputation: 53

How to extract non-matching lines from two text files?

Extracting matching lines using perl is known to me but I want the lines from two files which are not matching i.e. they are unique to the file among two text files.

file1:

one|E2027.1|073467|66   ATGCTATGTTTTGCTAAT  
one|E2002.1|073405|649  ATGAAAGCTTTAAAGAAA  
one|E2001.1|734704|201  ATGTTTTCAGGTATTATA  
one|E2025.1|073468|204  ATGAAACAGAAATATATT  
one|E2028.1|073431|578  ATGTTATTTAATTATGGT  
one|E2040.1|073743|862  ATGATTTATCCTAATAAT   

.........~2000 such lines

file2:

one|E2027.1|073467|66  
one|E5005.5|000005|005  
one|E2001.1|734704|201  
one|E2025.1|073468|204  
one|E2028.1|073431|578  
one|E2040.1|073743|862    

.........~2000 such lines

how to extract the lines not matching using perl or cmd commands?
here e.g. line 2 of file two is unique to file 2.....!!!

Here's what I have so far

foreach(@2) {
    @org=split('\t',$_);
    chomp($two=$_);
    foreach(@1) {
        if($_=~m/^$two.+/) {
            print OUT1 "$_";
        } else {
            print OUT2 "$_";
        }
    }
}

but else output gives GB of data.

Upvotes: 1

Views: 1136

Answers (4)

Nari
Nari

Reputation: 53

I got this; provided that data to be compared should be in single column in both files

use strict;use warnings;
print "Enter file1: ";
chomp($file=<STDIN>);
open(FH,$file);

print"Enter file2: ";
$hspfile=<STDIN>;
open(FH1,$hspfile);

my $list1;
my $list2;
my @list1 =<FH1> ;my @list2 =<FH> ;
print "enter output file1 : ";
$out = <STDIN>;
chomp($out);
open(OUT,">$out");
LIST2: foreach $list2 (@list2){
LIST1: foreach $list1 (@list1){
if ("$list2" eq "$list1") {
next LIST2;
}
}
print OUT"$list2";
}

Upvotes: 0

simbabque
simbabque

Reputation: 54373

You have to read in one of the files first. Then you can match against the content of each line of the other file. I used first from List::Util to do that. grep is fine, too, but first stops after it finds the first occurrence, which saves you time with large files.

use strict;
use warnings;
use List::Util qw(first);
use 5.014;

my $file1 = <<"FILE1";
one|E2027.1|073467|66\tATGCTATGTTTTGCTAAT
one|E2002.1|073405|649\tATGAAAGCTTTAAAGAAA
one|E2001.1|734704|201\tATGTTTTCAGGTATTATA
one|E2025.1|073468|204\tATGAAACAGAAATATATT
one|E2028.1|073431|578\tATGTTATTTAATTATGGT
one|E2040.1|073743|862\tATGATTTATCCTAATAAT
FILE1

my $file2 = <<"FILE2";
one|E2027.1|073467|66
one|E5005.5|000005|005
one|E2001.1|734704|201
one|E2025.1|073468|204
one|E2028.1|073431|578
one|E2040.1|073743|862
FILE2

my @file1_content = map { (split(/\t/))[0] } split /\n/, $file1;

foreach my $line (split /\n/, $file2) {
  chomp $line; # we need that because the split above is just a filler
  next if first { $_ eq $line } @file1_content;
  say $line;
}

I strongly suggest you use strict and warnings in all your programs. They both help you to find small, subtle mistakes. It's also a good idea to name your variables in a more descriptive way. Arrays named @1 and @2 are very bad. I had trouble understanding which variable did what.

Upvotes: 2

memowe
memowe

Reputation: 2668

Just to help you to improve your code:

foreach(@2) {
    @org=split('\t',$_);
    chomp($two=$_);
    foreach(@1) {
        if($_=~m/^$two.+/) {
            print OUT1 "$_";
        } else {
            print OUT2 "$_";
        }
    }
}

Do you know how often the code of the inner loop gets executed? scalar(@2) * scalar(@1) times which is about 4 millions in your example. That's the reason why your files get that big. Replace the inner loop by

$matched=0;
foreach(@1) {
    if($_=~m/^$two.+/) {
        $matched=1;
        last;
    }
}
if($matched) {
    print OUT1 $_;
} else {
    print OUT2 $_;
}

The inner loop now keeps track about matches and writing to files happens only in the outer loop. Note that I tried to adapt to your coding style!

CODING STYLE! ARGH! :D

That coding style is so from the last millennium! Let me add some notes how to make your code more secure, more readable and more debuggable:

  • always use strict; and use warnings;. Many errors can find early that way.
  • don't use global (package) variables, which isn't that seducing with strictures. Use lexical variables (my @lines = ...).
  • use proper variable names: @1 isn't very helpful. In fact, using its single elements ($1[42]) looks very confusing since $1 are Perl's regex capture variables. It doesn't have to be very poetic. A simple @lines would work but even @gargravarr is better than @1.
  • dont' use string interpolation when you don't need to. Acceptable use: "Hi $name, what's up?". Bad: print "$_". Just use print $_.
  • use white space. if($_=~m/^$two.+/) looks like line noise. For a comparison, look at this hand-crafted epic piece of beautiful Perl code:
foreach my $line (@lines) {
    print $differences $line
        if $line =~ /^$prefix.*/;
}

So let's try to rewrite that code:

my $matched = 0;

foreach my $line (@lines) {
    if ($line = ~/^$two.+/) {
        $matched=1;
        last;
    }
}

if ($matched) {
    print OUT1 $_;
} else {
    print OUT2 $_;
}

Feels so much better now! :) Know what you're doing! Don't just copy'n'paste code snippets.

Upvotes: 2

Guru
Guru

Reputation: 17004

#!/usr/bin/perl
use strict;
use warnings;

open my $fh1 ,'<', 'f1' or die $!;
open my $fh2 ,'<', 'f2' or die $!;
chomp(my @ar1=<$fh1>);
chomp(my @ar2=<$fh2>);
close $fh1;
close $fh2;

my @ar3=();
foreach my $x (@ar2) {
   push @ar3, $x if not grep (/^\Q$x\E/,@ar1);
}
print "@ar3";

where f1 and f2 are your files.

Upvotes: 0

Related Questions