Reputation: 11

Find matching IDs in two big files

I have 2 big files.

file1 has 160 million lines with this format: id:email

file2 has 45 million lines with this format: id:hash

The problem is to find all equal ids and save those to a third file, with the format: email:hash

Tried something like:

awk -F':' 'NR==FNR{a[$1]=$2;next} {print a[$1]":"$2}' test1.in test2.in > res.in

But it's not working :(

Example file1:

9305718:[email protected] 
59287478:[email protected]

file2:

21367509:e90100b1b668142ad33e58c17a614696ec04474c
9305718:d63fff1d21e1a04c066824dd2f83f3aeaa0edf6e

Desired result:

[email protected]:d63fff1d21e1a04c066824dd2f83f3aeaa0edf6e

Upvotes: 1

Answers (2)

Reputation: 37424

In AWK (not considering the amount of resources you have available):

$ awk -F':' 'NR==FNR{a[$1]=$2;next} a[$1] {print a[$1]":"$2}' test1.in test2.in
[email protected] :d63fff1d21e1a04c066824dd2f83f3aeaa0edf6e

Upvotes: 0

Reputation: 88731

With GNU join and GNU bash:

join -t : -j 1 <(sort -t : -k1,1 file1) <(sort -t : -k1,1 file2) -o 1.2,2.2

Update:

join -t: <(sort file1) <(sort file2) -o 1.2,2.2

Upvotes: 1