Reputation: 2412
I have two files, target and clean.
I want to extract the lines of target that has a matching first column to a line in clean.
I wrote a grep based loop to do this but its painfully slow. Speedups will be rewarded with upvotes and/or smileys.
clean = "/path/to/clean"
target = "/path/to/target"
oFile = "/output/file"
head -1 $target > $oFile
cat $clean | while read snp; do
echo $snp
grep $snp $target >> $oFile
done
$ head $clean
1_111_A_G
1_123_T_A
1_456_A_G
1_7892_C_G
Edit: Wrote a simple python script to do it.
clean_variants_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/clean_variants"
allChr_file = "/scratch2/vyp-scratch2/cian/UCLex_August2014/allChr_snpStats"
outfile = open("/scratch2/vyp-scratch2/cian/UCLex_August2014/results.tab","w")
clean_variant_dict = {}
for line in open(clean_variants_file):
clean_variant_dict[line.strip()] = 0
for line in open(allChr_file):
ll = line.strip().split("\t")
id_ = ll[0]
if id_ in clean_variant_dict:
outfile.write(line)
outfile.close()
Upvotes: 0
Views: 160
Reputation: 34120
For fun, I thought I would convert a solution or two into Perl6.
Note: These are probably going to be slower than the originals until Rakudo/NQP gets more optimizations, which really only started in earnest fairly recently at the time of posting.
First is TLP's Perl5 answer converted nearly one-to-one into Perl6.
#! /usr/bin/env perl6
# I have a link named perl6 aliased to Rakudo on MoarVM-jit
use v6;
multi sub MAIN ( Str $clean, Str $target ){ # same as the Perl5 version
MAIN( :$clean, :$target ); # call the named version
}
multi sub MAIN ( Str :$clean!, Str :$target! ){ # using named arguments
note "Processing clean file";
my %seen := SetHash.new;
for open( $clean, :r ).lines -> $line {
next unless $line.chars; # skip empty lines
%seen{$line}++;
}
note "Processing target file";
for open( $target, :r ).lines -> $line {
$line ~~ /^ $<first> = <-[\t]>+ /;
say $line if %seen{$<first>.Str};
}
}
I used MAIN
subroutines so that you will get a Usage
message if you don't give it the correct arguments.
I also used a SetHash
instead of a regular Hash
to reduce memory use since we don't need to know how many we have found, only that they were found.
Next I tried to combine all of the lines in the clean file into one regex.
This is similar to the sed
and grep
answer from Cyrus, except instead of many regexes there is only one.
I didn't want to change the subroutine that I had already written, so I added one that is differentiated by adding --single-regex
or -s
to the command line. ( All of the examples are in the same file )
multi sub MAIN ( Str :$clean!, Str :$target!, Bool :single-regex(:s($))! ){
note "Processing clean file";
my $regex;
{
my @regex = open( $clean, :r ).lines.grep(*.chars);
$regex = /^ [ | @regex ] /;
} # throw away @regex
note "Processing target file";
for open( $target, :r ).lines -> $line {
say $line if $line ~~ $regex;
}
}
I will say that I took quite a bit longer to write this than it would have taken me to write it in Perl5. Most of the time was taken up searching for some idioms online, and looking over the source files for Rakudo. I don't think it would take much effort to get better at Perl6 than Perl5.
Upvotes: 0
Reputation: 50657
Using a perl one-liner:
perl -F'\t' -lane '
BEGIN{ local @ARGV = pop; @s{<>} = () }
print if exists $s{"$F[0]\n"}
' target clean
Switches:
-F
: Alternate pattern for -a
switch-l
: Enable line ending processing-a
: Splits the line on space and loads them in an array @F
-n
: Creates a while(<>){...}
loop for each “line” in your input file. -e
: Tells perl
to execute the code on command line. Or as a perl script:
use strict;
use warnings;
die "Usage: $0 target clean\n" if @ARGV != 2;
my %s = do {
local @ARGV = pop;
map {$_ => 1} (<>)
};
while (<>) {
my ($f) = split /\t/;
print if $s{"$f\n"}
}
Upvotes: 2
Reputation: 67900
This Perl solution would use quite a lot of memory (because we load the entire file into memory), but would save you from looping twice. It uses a hash for duplicate checking, where each line is stored as a key. Note that this code is not thoroughly tested, but seems to work on a limited set of data.
use strict;
use warnings;
my ($clean, $target) = @ARGV;
open my $fh, "<", $clean or die "Cannot open file '$clean': $!";
my %seen;
while (<$fh>) {
chomp;
$seen{$_}++;
}
open $fh, "<", $target
or die "Cannot open file '$target': $!"; # reuse file handle
while (<$fh>) {
my ($first) = /^([^\t]*)/;
print if $seen{$first};
}
If your target file is proper tab separated CSV data, you could use Text::CSV_XS
which reportedly is very fast.
Upvotes: 3
Reputation: 77991
python solution:
with open('/path/to/clean', 'r') as fin:
keys = set(fin.read().splitlines())
with open('/path/to/target', 'r') as fin, open('/output/file', 'w') as fout:
for line in fin:
if line[:line.index('\t')] in keys:
fout.write(line)
Upvotes: 2