Reputation: 45
I have two files
File f1 (has around 10000 unique lines only in one column)
Column1
line1
line2
line3
....
line 10000
File f2 (has multiple columns and around 300000 lines)
The first column of file f2 has common elements with file f1. I want to grep those common elements (all 10000) along with the contents of other columns in file2.
So far, I have tried
grep -f f1 f2
and also
grep -F -f f1 f2
but both of these are giving me some extra lines in the final output (10000+). The first column of both the files have some contents separated by '/' which might need more text manipulations
eg.
Column1
a/b/c
e/f/g
x/y
Upvotes: 0
Views: 110
Reputation: 27225
grep -f
is a good start, however, there are two problems:
Depending on the data it could be troublesome to ensure, that exactly column 1 from file 2 is matched (not only parts of it and no other columns). Examples:
f1
= line1
and f2
= line100 a b
you get a false positive since grep finds the string line1
in line100
. This could be prevented with grep's -w
option. f1
= line1
and f2
= line2 a line1
you get a false positive since grep finds the string line1
in the third column which shouldn't be searched at all. Errors like these are hard to prevent using grep
. A safe way would be to generate extended regex patterns (something like grep -Ef <(sed 's/.*/^& /' f1) f2
, requires additional effort to quote the lines from f1
) but that would be complicated and slow.The command could be more efficient. Depending on the implementation grep -f
may check all n
lines from file 1 for every of the m
lines in file 2.
In the worst case there would be O(n*m)
line operations.
The following command may run faster. Also, there won't be false positives. This is a simple version for files without headers:
join <(sort f1) (sort f2)
and this is the version for files with headers
hsort() { IFS= read -r header; printf %s\\n "$header"; sort; }
join --header <(hsort < f1) <(hsort < f2)
I expect this to do at most O(m log m)
line operations.
O(sorting f1 + sorting f2 + joining)
= O(n log n + m log m + max(n,m)) | n < m
= O(2 m log m + m)
= O(m log m)
Upvotes: 1