Arun
Arun

Reputation: 1220

BASH ISSUE : Compare Two Different Larger Set Text Files and Get the Matching IP address

I have two TXT files , 1.txt has 11,000 IPs and 2.txt has 1 Million IPs. I want to match 1.txt against 2.txt ( 1 million IPs) and get the matching ones.

#1.txt
1,1.1.1.1
2,2.2.2.2
3,3.3.3.3
.........

#2.txt
51.51.6.10
12.10.25.16
1.3.50.55
0.0.0.0
6.6.6.6
1.1.1.1
2.2.2.2
5.5.5.5
6.6.6.6
7.7.7.7
20.200.100.30
Like wise 1 Million lines of IPs.......

Matching Result :
1,1.1.1.1
2,2.2.2.2
  1. I tried doing awk -F, 'NR==FNR{a[$0];next}($2 in a)' 2.txt 1.txt,It gives me the exact answer for the smaller subset(Test Runs). But checking against the original files 11,000 against 1 Million IPs,It's returning me all the IPs which is in 1.txt.

  2. Tried sed -n -f <(sed 's|.*|/,&$/p|' 2.txt) 1.txt, Process is automatically killed.

  3. Tried, comm -23 1.txt 2.txt > 3.txt,Again returning all the IPs from 1.txt.

Not sure with the issue on where i'm making mistakes / matching against 1 million IPs is not possible using sed , awk , comm or any ? Can some one help me on suggesting what will be the issue ?

Reference Used : http://stackoverflow.com/questions/4366533/remove-lines-from-file-which-appear-in-another-file

Upvotes: 1

Views: 177

Answers (1)

mauro
mauro

Reputation: 5950

Assumption #1: files are sorted as show in your original question

Assumption #2: ip addresses are unique

If you want just the IP addresses:

$ comm -12 <(cut -d, -f2 1.txt) 2.txt 
1.1.1.1
2.2.2.2

If you want the whole line in 1.txt:

$ comm -12 <(cut -d, -f2 1.txt) 2.txt  | while read ip ; do grep $ip 1.txt ; done
1,1.1.1.1
2,2.2.2.2

UPDATE

If my Assumption#1 is not valid, then you have to sort 1.txt and 2.txt in-line.

This is the solution to get just common IP addresses:

$ comm -12 <(cut -d, -f2 1.txt |sort) <(sort 2.txt) 
1.1.1.1
2.2.2.2

and this will show the full line from 1.txt:

$ comm -12 <(cut -d, -f2 1.txt |sort) <(sort 2.txt) | while read ip ; do grep $ip 1.txt ; done
1,1.1.1.1
2,2.2.2.2

I also made a quick test on my small MacBook Air using 1ML IPs in 1.txt and 0.5ML IPs in 2.txt. It takes 19 seconds if files have to be sorted.

Upvotes: 1

Related Questions