Reputation: 1131
I have 3 files (links_file, my_links and my_queue) and I'm doing 3 things with the links_file:
I have working code, but it is taking a long time (more than 10 minutes) for around 30.000 lines in links_file, 1.000 in the my_links file and 300 in my_queue file.
function clean_file(){
links_file="$1"
my_links="$2"
my_queue="$3"
out_file="$4"
rm -rf "$out_file"
prev_url=""
cat "$links_file" | while read line
do
img_url=$(echo $line | perl -pe 's/[ \t].*//g' | perl -pe 's/(.*)_.*/$1/g')
# $links_file is sorted by img_url, so i can just check the previous value
test "$prev_url" = "$img_url" && echo "duplicate: $img_url" && continue
prev_url="$img_url"
test $(grep "$img_url" "$my_links" | wc -l) -ne 0 && echo "in my_links: $img_url" && continue
test $(grep "$img_url" "$my_queue" | wc -l) -ne 0 && echo "in my_queue: $img_url" && continue
echo "$line" >> "$out_file"
done
}
I'm trying to optimize the code, but ran out of ideas. My knowledge of perl is limited (I typically only use it for simple regular expressions replacement). Any help to optimize this would be appreciated.
Upvotes: 0
Views: 135
Reputation: 17090
Let's do it step by step.
First, no need to call Perl twice. Instead of img_url=$(echo $line | perl -pe 's/[ \t].*//g' | perl -pe 's/(.*)_.*/$1/g')
, you can just do
img_url=$(echo $line | perl -pe 's/[ \t].*//g;s/(.*)_.*/$1/g')
But then, we can combine the two regex' together:
s/.*_([^ \t]*).*/$1/
(find a group of non-empty characters following an underscore)
Also, Perl is an overkill where sed
suffices:
img_url=$(echo $line | sed "s/.*_\([^ \t]*\).*/\1/")
But hey, maybe Perl should be actually your method of choice. You see, for every url read you read the two files (queue and links) in their entirety to find a matching line. If only there was a way of reading them and keeping the inventory in memory! Ohwait. Yes, we could do it in bash. No, I would not like to do it :-)
The Perl script below is neither particularly complex nor optimized, but should be way faster than your approach. And I tried to make it easy to understand; actually, above certain level (and you are definitely on that level), Perl is much simpler to write than bash.
#!/usr/bin/perl
use strict ;
use warnings ;
my $my_links = "my_links" ;
my $my_queue = "my_queue" ;
# define the regular expression to find the img_url
my $regex = '.*_([^\s]*).*' ;
my %links = geturls( $my_links ) ;
my %queue = geturls( $my_queue ) ;
# loop over STDIN trying to find the match
my %index ;
while( <STDIN> ) {
next unless m/$regex/ ; # ignore lines that do not match
next if( $links{$1} || $queue{$1} || $index{$1} ) ;
$index{$1}++ ; # index hash to eliminate duplicates
print $_ ;
}
# function to store the two files (my_links and my_queue) in the memory.
# we populate a hash with the img urls read.
sub geturls {
my $fname = shift ;
my %ret ;
open my $fh, $fname or die "Cannot open $fname" ;
while( <$fh> ) {
next unless m/$regex/ ; # ignore lines that do not match
# $1 holds the subexpression within the parentheses
$ret{$1}++ ;
}
return %ret ;
}
The script will remove any duplicates, even those not on consecutive lines -- hope you don't mind.
One caveat, though: I assumed that all files follow a similar structure. Please provide example files and desired output next time you ask a question here.
Upvotes: 1