Don Richards
Don Richards

Reputation: 61

Most Efficient way to find pairs between 2 arrays in bash

I have 2 large arrays with hash values stored in them. I'm trying to find the best way to verify all of the hash values in array_a are also found in array_b. The best I've got so far is

  1. Import the Hash files into an array
  2. Sort each array
  3. For loop through array_a
  4. Inside of array_a's for loop, do another for look for array_b (seems inefficient).
  5. If found unset value in array_b
  6. Set "found" value to 1 and break loop
  7. If array_a doesn't have a match output to file.

I have large images that I need to verify have been uploaded to the site and the hash values match. I've created a file from the original files and scraped the website ones to create a second list of hash values. Trying to keep this a vanilla as possible, so only using typical bash functionality.

#!/bin/bash

array_a=($(< original_sha_values.txt))
array_b=($(< sha_values_after_downloaded.txt))

# Sort to speed up.
IFS=$'\n' array_a_sorted=($(sort <<<"${array_a[*]}"))
unset IFS
IFS=$'\n' array_b_sorted=($(sort <<<"${array_b[*]}"))
unset IFS

for item1 in "${array_a_sorted[@]}" ; do
  found=0

  for item2 in "${!array_b_sorted[@]}" ; do
    if [[ $item1 == ${array_b_sorted[$item2]} ]]; then
      unset 'array_b_sorted[item2]'
      found=1
      break
    fi
  done

  if [[ $found == 0 ]]; then
    echo "$item1" >> hash_is_missing_a_match.log
  fi
done

Sorting to sped it up a lot

IFS=$'\n' array_a_sorted=($(sort <<<"${array_a[*]}"))
unset IFS
IFS=$'\n' array_b_sorted=($(sort <<<"${array_b[*]}"))
unset IFS

Is this really the best way of doing this?

for item1 in "${array_a_sorted[@]}" ; do
 ...
  for item2 in "${!array_b_sorted[@]}" ; do
    if ...
       unset 'array_b_sorted[item2]'
       break

Both arrays have 12,000 lines of 64bit hashes, taking 20+ minutes to compare. Is there a way to improve the speed?

Upvotes: 1

Views: 188

Answers (2)

Mike Holt
Mike Holt

Reputation: 4612

I think karakfa's answer is probably the best approach if you just want to get it done and not worry about optimizing bash code.

However, if you still want to do it in bash, and you are willing to use some bash-specific features, you could shave off a lot of time using an associative array instead of two regular arrays:

# Read the original hash values into a bash associative array

declare -A original_hashes=()
while read hash; do
  original_hashes["$hash"]=1
done < original_sha_values.txt

# Then read the downloaded values and check each one to see if it exists
# in the associative array. Lookup time *should* be O(1)

while read hash; do
  if [[ -z "${original_hashes["$hash"]+x}" ]]; then
    echo "$hash" >> hash_is_missing_a_match.log
  fi
done < sha_values_after_downloaded.txt

This should be a lot faster than the nested loop implementation using regular arrays. Also, I didn't need any sorting, and all of the insertions and lookups on the associative array should be O(1), assuming bash implements associative arrays as hash tables. I couldn't find anything authoritative to back that up though, so take that with a grain of salt. Either way, it should still be faster than the nested loop method.

If you want the output sorted, you can just change the last line to:

done < <(sort sha_values_after_downloaded.txt)

in which case you're still only having to sort one file, not two.

Upvotes: 1

karakfa
karakfa

Reputation: 67547

you're doing it hard way.

If the task is: find the entries in file1 not in file2. Here is a shorter approach

$ comm -23 <(sort f1) <(sort f2)

Upvotes: 3

Related Questions