Bash script to efficiently return two file names that both contain a string found in a list

Question

I'm trying to find duplicates of a string ID across files. Each of these IDs are unique and should be used in only one file. I am trying to verify that each ID is only used once, and the script should tell me the ID which is duplicated and in which files.

This is an example of the set.csv file

"Read-only",,"T","ID6776","3.1.1","Text","?"
"Read-only",,"T","ID4294","3.1.1.1","Text","?"
"Read-only","ID","T","ID7294","a )","Text","?"
"Read-only","ID","F","ID8641","b )","Text","?"
"Read-only","ID","F","ID8642","c )","Text","?"
"Read-only","ID","T","ID9209","d )","Text","?"
"Read-only","ID","F","ID3759","3.1.1.2","Text","?"
"Read-only",,"F","ID2156","3.1.1.3","

This is the very inefficient code I wrote

for ID in $(grep 'ID\"\,\"[TF]' set.csv | cut -c 23-31);
do for FILE1 in *.txt; do for FILE2 in *.txt; 
do if [[ $FILE1 -nt $FILE2 && `grep -E '$ID' $FILE1 $FILE2` ]]; 
then echo $ID + $FILE1 + $FILE2; 
fi; 
done; 
done; 
done

Essentially I'm only interested in ID#s that are identified as "ID" in the CSV which would be 7294, 8641, 8642, 9209, 3759 but not the others. If File1 and File2 both contain the same ID from this set then it would print out the duplicated ID and each file that it is found in.

There might be thousands of IDs, and files so my exponential approach isn't at all preferred. If Bash isn't up to it I'll move to sets, hashmaps and a logarithmic searching algorithm in another language... but if the shell can do it I'd like to know how.

Thanks!

Edit: Bonus would be to find which IDs from the set .csv aren't used at all. A pseudo code for another language might be create a set for all the IDs in the csv, then make another set and add to it IDs found in the files, then compare the sets. Can bash accomplish something like this?

Jeff Schaller · Accepted Answer

A linear option would be to use awk to store discovered identifiers with their corresponding filename, then report when an identifier is found again. Assuming

awk -F, '$2 == "\"ID\"" && ($3 == "\"T\"" || $3 == "\"F\"") {
  id=substr($4,4,4)
  if(ids[id]) {
    print id " is in " ids[id] " and " FILENAME;
  } else {
    ids[id]=FILENAME;
  }
}' *.txt

The awk script looks through every *.txt file; it splits the fields based on commas (-F,). If field 2 is "ID" and field 3 is "T" or "F", then it extracts the numeric ID from field 4. If that ID has been seen before, it reports the previous file and the current filename; otherwise, it saves the id with an association to the current filename.

Bash script to efficiently return two file names that both contain a string found in a list

Answers (1)

Related Questions