Reputation: 1201
I have a big data file called fileA having the following format
col1 0.1111,0.2222,0.33333,0.4444
col5 0.1111,0.2222,0.33333,0.4444
col3 0.1111,0.2222,0.33333,0.4444
col4 0.1111,0.2222,0.33333,0.4444
The separator between 1st and 2nd columns is \t. Other separators are comma. I have another file containing the name of rows I am interested in, called fileB, which looks like:
col3
col1
...
Both files are not sorted. I want to retrieve all the rows from fileA with names appearing in fileB. The code grep -f fileB fileA
does this job, but I think it will search all fileds in fileA, which takes long time. How can I specify only to search the 1st column in fileA?
Upvotes: 0
Views: 383
Reputation: 2762
join <(sort -t $'\t' -k 1 fileA) <(sort -t $'\t' -k 1 fileB)
Files are sorted in O(n.log(n)+p.log(p)) then they're merged in O(n+p), I don't think we can do better than that.
EDIT Ok, we can do better with a hash table which will be O(n+p).
Upvotes: 1
Reputation: 195029
linear time O(n) solution without sorting. (I didn't test, hope no typo):
awk -F'\t' 'NR==FNR{a[$0]=7;next}a[$1]' fileB fileA
note that the get
operation on a hashtable is considered as O(1)
Upvotes: 0