optimize multiple executions of grep

Question

I have 3 files (links_file, my_links and my_queue) and I'm doing 3 things with the links_file:

remove lines that have duplicate information (not all the line is to be checked, only a part of it, in the code bellow it is the var img_url). Keep the first line with the img_url
remove lines where the img_url string exists in my_links file
remove lines where the img_url string exists in my_queue file

I have working code, but it is taking a long time (more than 10 minutes) for around 30.000 lines in links_file, 1.000 in the my_links file and 300 in my_queue file.

function clean_file(){
    links_file="$1"
    my_links="$2"
    my_queue="$3"
    out_file="$4"

    rm -rf "$out_file"
    prev_url=""
    cat "$links_file" | while read line
    do
        img_url=$(echo $line | perl -pe 's/[ 	].*//g' | perl -pe 's/(.*)_.*/$1/g')
        # $links_file is sorted by img_url, so i can just check the previous value
        test "$prev_url" = "$img_url" && echo "duplicate: $img_url" && continue
        prev_url="$img_url"
        test $(grep "$img_url" "$my_links" | wc -l) -ne 0 && echo "in my_links: $img_url" && continue
        test $(grep "$img_url" "$my_queue" | wc -l) -ne 0 && echo "in my_queue: $img_url" && continue
        echo "$line" >> "$out_file"
    done
}

I'm trying to optimize the code, but ran out of ideas. My knowledge of perl is limited (I typically only use it for simple regular expressions replacement). Any help to optimize this would be appreciated.

January · Accepted Answer

Let's do it step by step.

First, no need to call Perl twice. Instead of img_url=$(echo $line | perl -pe 's/[ \t].*//g' | perl -pe 's/(.*)_.*/$1/g'), you can just do

img_url=$(echo $line | perl -pe 's/[ \t].*//g;s/(.*)_.*/$1/g')

But then, we can combine the two regex' together:

s/.*_([^ \t]*).*/$1/

(find a group of non-empty characters following an underscore)

Also, Perl is an overkill where sed suffices:

img_url=$(echo $line | sed "s/.*_$[^ \t]*$.*/\1/")

But hey, maybe Perl should be actually your method of choice. You see, for every url read you read the two files (queue and links) in their entirety to find a matching line. If only there was a way of reading them and keeping the inventory in memory! Ohwait. Yes, we could do it in bash. No, I would not like to do it :-)

The Perl script below is neither particularly complex nor optimized, but should be way faster than your approach. And I tried to make it easy to understand; actually, above certain level (and you are definitely on that level), Perl is much simpler to write than bash.

#!/usr/bin/perl

use strict   ;
use warnings ;

my $my_links = "my_links" ;
my $my_queue = "my_queue" ;
# define the regular expression to find the img_url
my $regex = '.*_([^\s]*).*' ;

my %links = geturls( $my_links ) ;
my %queue = geturls( $my_queue ) ;

# loop over STDIN trying to find the match

my %index ;
while(  ) {
  next unless m/$regex/ ; # ignore lines that do not match
  next if( $links{$1} || $queue{$1} || $index{$1} ) ; 
  $index{$1}++ ; # index hash to eliminate duplicates
  print $_ ;
} 

# function to store the two files (my_links and my_queue) in the memory.
# we populate a hash with the img urls read.
sub geturls {

  my $fname = shift ;
  my %ret ; 

  open my $fh, $fname or die "Cannot open $fname" ;

  while( <$fh> ) {
    next unless m/$regex/  ; # ignore lines that do not match
    # $1 holds the subexpression within the parentheses
    $ret{$1}++ ; 
  } 

  return %ret ;
}

The script will remove any duplicates, even those not on consecutive lines -- hope you don't mind.

One caveat, though: I assumed that all files follow a similar structure. Please provide example files and desired output next time you ask a question here.

optimize multiple executions of grep

Answers (1)

Related Questions