Reputation: 74390
The details of my query follow:
\t\n
, which is a trivial test and not the subject of this question. That will remove some 75% of the lines, right off the bat, reducing the workload.2,3,12-18,25-28,31
.One option is to obviously use the following simple code, which I've tried to nicely format and include comments to show my reasoning:
use warnings;
use strict;
# I am using the latest stable version of Perl for this exercise
use 5.30.0;
while (<>)
{
# Skip lines ending with an empty field
next if substr($_,-2) eq "\t\n";
# Remove "\n"
chomp;
# Split matching lines into fields on "\t", creating @fields
my @fields=split(/\t/,$_);
# Copy only the desired fields from @fields to create a new
# line in TSV format
# This can be done in one simple step in Perl, using
# array slices and the join() function
my $new_line=join("\t",@fields[2,3,12..18,25..28,31]);
# ...
}
But, using split
leads to extra parsing (beyond the last field I need) and produces a complete array of fields which I also don't need. I think it would be more efficient to not create the array, but to parse each line looking for tabs and counting the field indexes as I go, creating the output line on the way, and stopping at the last field I need.
Am I correct in my assessment, or is just doing a simple split
, followed by a join
of the slices containing the fields of interest, the best way to go here from a performance perspective?
Update: Unfortunately, no one mentioned the possibility of using GNU cut
for the split and piping the results into Perl for the rest of the processing. This is probably the most performant way, without writing lots of custom (C) code to do this or resorting to large block based reads with custom line parsing (also in C).
Upvotes: 2
Views: 114
Reputation: 240
grep -P -v "\t\s*$" yourFile.tsv | cut -f2,3,12-18,25-28,31
You don't even have to write a perl code for this.
Here,
-P
is "perl grep" which provides more functionality to naive grep.
-v
is inverse matching, which corresponds to your next if
BTW, if you have enough cores and memory, then you might want to speed up the process by split and merge as:
split -n 10 -d yourFile.tsv yourFile.tsv.
That will generate yourFile.tsv.00, ..., yourFile.tsv.09
Thus, the whole code looks like something like the block in the below:
`split -n 10 -d yourFile.tsv yourFile.tsv.`
@yourFiles = `ls yourFile.tsv.*`;
foreach $file (@yourFiles) {
`grep -P -v "\t\s*$" $file | cut -f2,3,12-18,25-28,31 > $file.filtered &`;
}
`cat yourFile.*.filtered > final.output.tsv`
Upvotes: 0
Reputation: 98398
You can tell split when to stop with its limit parameter:
my @fields=split(/\t/,$_,33);
(Specify one more than the number of fields you actually want, because the last field it produces will contain the remainer of the line.)
Upvotes: 5