Reputation: 1220
I have two TXT files , 1.txt has 11,000 IPs and 2.txt has 1 Million IPs. I want to match 1.txt against 2.txt ( 1 million IPs) and get the matching ones.
#1.txt
1,1.1.1.1
2,2.2.2.2
3,3.3.3.3
.........
#2.txt
51.51.6.10
12.10.25.16
1.3.50.55
0.0.0.0
6.6.6.6
1.1.1.1
2.2.2.2
5.5.5.5
6.6.6.6
7.7.7.7
20.200.100.30
Like wise 1 Million lines of IPs.......
Matching Result :
1,1.1.1.1
2,2.2.2.2
I tried doing awk -F, 'NR==FNR{a[$0];next}($2 in a)' 2.txt 1.txt
,It gives me the exact answer for the smaller subset(Test Runs). But checking against the original files 11,000 against 1 Million IPs,It's returning me all the IPs which is in 1.txt
.
Tried sed -n -f <(sed 's|.*|/,&$/p|' 2.txt) 1.txt
, Process is automatically killed.
Tried, comm -23 1.txt 2.txt > 3.txt
,Again returning all the IPs from 1.txt.
Not sure with the issue on where i'm making mistakes / matching against 1 million IPs is not possible using sed , awk , comm or any ? Can some one help me on suggesting what will be the issue ?
Reference Used : http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file
Upvotes: 1
Views: 177
Reputation: 5950
Assumption #1: files are sorted as show in your original question
Assumption #2: ip addresses are unique
If you want just the IP addresses:
$ comm -12 <(cut -d, -f2 1.txt) 2.txt
1.1.1.1
2.2.2.2
If you want the whole line in 1.txt:
$ comm -12 <(cut -d, -f2 1.txt) 2.txt | while read ip ; do grep $ip 1.txt ; done
1,1.1.1.1
2,2.2.2.2
UPDATE
If my Assumption#1 is not valid, then you have to sort 1.txt and 2.txt in-line.
This is the solution to get just common IP addresses:
$ comm -12 <(cut -d, -f2 1.txt |sort) <(sort 2.txt)
1.1.1.1
2.2.2.2
and this will show the full line from 1.txt:
$ comm -12 <(cut -d, -f2 1.txt |sort) <(sort 2.txt) | while read ip ; do grep $ip 1.txt ; done
1,1.1.1.1
2,2.2.2.2
I also made a quick test on my small MacBook Air using 1ML IPs in 1.txt and 0.5ML IPs in 2.txt. It takes 19 seconds if files have to be sorted.
Upvotes: 1