user37774
user37774

Reputation: 3

comparing lines with awk vs while read line

I have two files one with 17k lines and another one with 4k lines. I wanted to compare position 115 to position 125 with each line in the second file and if there is a match, write the entire line from the first file into a new file. I had come up with a solution where i read the file using 'cat $filename | while read LINE'. but it's taking around 8 mins to complete. is there any other way like using 'awk' to reduce this process time.

my code

cat $filename | while read LINE
do
  #read 115 to 125 and then remove trailing spaces and leading zeroes
  vid=`echo "$LINE" | cut -c 115-125 | sed 's,^ *,,; s, *$,,' | sed 's/^[0]*//'`
  exist=0
  #match vid with entire line in id.txt
  exist=`grep -x "$vid" $file_dir/id.txt | wc -l`
  if [[ $exist -gt 0 ]]; then
    echo "$LINE" >> $dest_dir/id.txt
  fi
done

Upvotes: 0

Views: 717

Answers (1)

Chris Seymour
Chris Seymour

Reputation: 85865

How is this:

FNR==NR {                      # FNR == NR is only true in the first file

    s = substr($0,115,10)      # Store the section of the line interested in 
    sub(/^\s*/,"",s)           # Remove any leading whitespace
    sub(/\s*$/,"",s)           # Remove any trailing whitespace

    lines[s]=$0                # Create array of lines
    next                       # Get next line in first file
}
{                              # Now in second file
    for(i in lines)            # For each line in the array
        if (i~$0) {            # If matches the current line in second file 
            print lines[i]     # Print the matching line from file1
            next               # Get next line in second file
        }
}

Save it to a script script.awk and run like:

$ awk -f script.awk "$filename" "${file_dir}/id.txt" > "${dest_dir}/id.txt"

This will still be slow because for each line in second file you need to look at ~50% of the unique lines in first (assuming most line do in fact match). This can be significantly improved if you can confirmed that the lines in the second file are full line matches against the substrings.


For full line matches this should be faster:

FNR==NR {                      # FNR == NR is only true in the first file

    s = substr($0,115,10)      # Store the section of the line interested in 
    sub(/^\s*/,"",s)           # Remove any leading whitespace
    sub(/\s*$/,"",s)           # Remove any trailing whitespace

    lines[s]=$0                # Create array of lines
    next                       # Get next line in first file
}
($0 in lines) {                  # Now in second file
    print lines[$0]     # Print the matching line from file1
}

Upvotes: 2

Related Questions