user8217644
user8217644

Reputation: 45

Is there a faster way to look for the contents of one file in another?

I have two files

File f1 (has around 10000 unique lines only in one column)

Column1  
line1  
line2  
line3  
....  
line 10000

File f2 (has multiple columns and around 300000 lines)
The first column of file f2 has common elements with file f1. I want to grep those common elements (all 10000) along with the contents of other columns in file2.

enter image description here

So far, I have tried

grep -f f1 f2  
and also 
grep -F -f f1 f2 

but both of these are giving me some extra lines in the final output (10000+). The first column of both the files have some contents separated by '/' which might need more text manipulations eg. Column1
a/b/c
e/f/g
x/y

Upvotes: 0

Views: 110

Answers (1)

Socowi
Socowi

Reputation: 27225

grep -f is a good start, however, there are two problems:

  1. Depending on the data it could be troublesome to ensure, that exactly column 1 from file 2 is matched (not only parts of it and no other columns). Examples:

    • For the files f1 = line1 and f2 = line100 a b you get a false positive since grep finds the string line1 in line100. This could be prevented with grep's -w option.
    • For the files f1 = line1 and f2 = line2 a line1 you get a false positive since grep finds the string line1 in the third column which shouldn't be searched at all. Errors like these are hard to prevent using grep. A safe way would be to generate extended regex patterns (something like grep -Ef <(sed 's/.*/^& /' f1) f2, requires additional effort to quote the lines from f1) but that would be complicated and slow.
  2. The command could be more efficient. Depending on the implementation grep -f may check all n lines from file 1 for every of the m lines in file 2. In the worst case there would be O(n*m) line operations.

The following command may run faster. Also, there won't be false positives. This is a simple version for files without headers:

join <(sort f1) (sort f2)

and this is the version for files with headers

hsort() { IFS= read -r header; printf %s\\n "$header"; sort; }
join --header <(hsort < f1) <(hsort < f2)

I expect this to do at most O(m log m) line operations.

O(sorting f1 + sorting f2 + joining)
= O(n log n + m log m + max(n,m)) | n < m
= O(2 m log m + m)
= O(m log m)

Upvotes: 1

Related Questions