Reputation: 53
Extracting matching lines using perl is known to me but I want the lines from two files which are not matching i.e. they are unique to the file among two text files.
file1:
one|E2027.1|073467|66 ATGCTATGTTTTGCTAAT
one|E2002.1|073405|649 ATGAAAGCTTTAAAGAAA
one|E2001.1|734704|201 ATGTTTTCAGGTATTATA
one|E2025.1|073468|204 ATGAAACAGAAATATATT
one|E2028.1|073431|578 ATGTTATTTAATTATGGT
one|E2040.1|073743|862 ATGATTTATCCTAATAAT
.........~2000 such lines
file2:
one|E2027.1|073467|66
one|E5005.5|000005|005
one|E2001.1|734704|201
one|E2025.1|073468|204
one|E2028.1|073431|578
one|E2040.1|073743|862
.........~2000 such lines
how to extract the lines not matching using perl or cmd commands?
here e.g. line 2 of file two is unique to file 2.....!!!
Here's what I have so far
foreach(@2) {
@org=split('\t',$_);
chomp($two=$_);
foreach(@1) {
if($_=~m/^$two.+/) {
print OUT1 "$_";
} else {
print OUT2 "$_";
}
}
}
but else output gives GB of data.
Upvotes: 1
Views: 1136
Reputation: 53
I got this; provided that data to be compared should be in single column in both files
use strict;use warnings;
print "Enter file1: ";
chomp($file=<STDIN>);
open(FH,$file);
print"Enter file2: ";
$hspfile=<STDIN>;
open(FH1,$hspfile);
my $list1;
my $list2;
my @list1 =<FH1> ;my @list2 =<FH> ;
print "enter output file1 : ";
$out = <STDIN>;
chomp($out);
open(OUT,">$out");
LIST2: foreach $list2 (@list2){
LIST1: foreach $list1 (@list1){
if ("$list2" eq "$list1") {
next LIST2;
}
}
print OUT"$list2";
}
Upvotes: 0
Reputation: 54373
You have to read in one of the files first. Then you can match against the content of each line of the other file. I used first
from List::Util to do that. grep
is fine, too, but first
stops after it finds the first occurrence, which saves you time with large files.
use strict;
use warnings;
use List::Util qw(first);
use 5.014;
my $file1 = <<"FILE1";
one|E2027.1|073467|66\tATGCTATGTTTTGCTAAT
one|E2002.1|073405|649\tATGAAAGCTTTAAAGAAA
one|E2001.1|734704|201\tATGTTTTCAGGTATTATA
one|E2025.1|073468|204\tATGAAACAGAAATATATT
one|E2028.1|073431|578\tATGTTATTTAATTATGGT
one|E2040.1|073743|862\tATGATTTATCCTAATAAT
FILE1
my $file2 = <<"FILE2";
one|E2027.1|073467|66
one|E5005.5|000005|005
one|E2001.1|734704|201
one|E2025.1|073468|204
one|E2028.1|073431|578
one|E2040.1|073743|862
FILE2
my @file1_content = map { (split(/\t/))[0] } split /\n/, $file1;
foreach my $line (split /\n/, $file2) {
chomp $line; # we need that because the split above is just a filler
next if first { $_ eq $line } @file1_content;
say $line;
}
I strongly suggest you use strict
and warnings
in all your programs. They both help you to find small, subtle mistakes. It's also a good idea to name your variables in a more descriptive way. Arrays named @1
and @2
are very bad. I had trouble understanding which variable did what.
Upvotes: 2
Reputation: 2668
Just to help you to improve your code:
foreach(@2) {
@org=split('\t',$_);
chomp($two=$_);
foreach(@1) {
if($_=~m/^$two.+/) {
print OUT1 "$_";
} else {
print OUT2 "$_";
}
}
}
Do you know how often the code of the inner loop gets executed? scalar(@2) * scalar(@1)
times which is about 4 millions in your example. That's the reason why your files get that big. Replace the inner loop by
$matched=0;
foreach(@1) {
if($_=~m/^$two.+/) {
$matched=1;
last;
}
}
if($matched) {
print OUT1 $_;
} else {
print OUT2 $_;
}
The inner loop now keeps track about matches and writing to files happens only in the outer loop. Note that I tried to adapt to your coding style!
That coding style is so from the last millennium! Let me add some notes how to make your code more secure, more readable and more debuggable:
use strict;
and use warnings;
. Many errors can find early that way.strict
ures. Use lexical variables (my @lines = ...
).@1
isn't very helpful. In fact, using its single elements ($1[42]
) looks very confusing since $1
are Perl's regex capture variables. It doesn't have to be very poetic. A simple @lines
would work but even @gargravarr
is better than @1
."Hi $name, what's up?"
. Bad: print "$_"
. Just use print $_
.if($_=~m/^$two.+/)
looks like line noise. For a comparison, look at this hand-crafted epic piece of beautiful Perl code:foreach my $line (@lines) { print $differences $line if $line =~ /^$prefix.*/; }
So let's try to rewrite that code:
my $matched = 0;
foreach my $line (@lines) {
if ($line = ~/^$two.+/) {
$matched=1;
last;
}
}
if ($matched) {
print OUT1 $_;
} else {
print OUT2 $_;
}
Feels so much better now! :) Know what you're doing! Don't just copy'n'paste code snippets.
Upvotes: 2
Reputation: 17004
#!/usr/bin/perl
use strict;
use warnings;
open my $fh1 ,'<', 'f1' or die $!;
open my $fh2 ,'<', 'f2' or die $!;
chomp(my @ar1=<$fh1>);
chomp(my @ar2=<$fh2>);
close $fh1;
close $fh2;
my @ar3=();
foreach my $x (@ar2) {
push @ar3, $x if not grep (/^\Q$x\E/,@ar1);
}
print "@ar3";
where f1 and f2 are your files.
Upvotes: 0