Evan
Evan

Reputation: 259

python compare two large lists and process one

I have two text files. In a.txt,there are lines like this(one million lines):

991000000019999998,b10000021,
991000000019703408,b10000021,
991000545455435408,b10000045,
991000000029703408,b10000045,
...

the first part is barcode(991000000019703408), the second part is bib_number(b10000021). Notice that the bib_number is possible to be duplicate in each line.But the barcode is unique. So using Set() I think is not okay. In another file b.txt, the info only about bib_number(600 thousand record):

b10000021
b10000045
b10000215
...

Now I have to compare the two files,in a.txt,if eachline's bib_number (like b10000045) is not in b.txt, this whole line need to be output to c.txt, like(991000000029703408,b10000045,)

I write the code like this, but I have not got the reuslt until 20 mins.

with open("a.txt", "r") as f1,open("b.txt", "r") as f2,open("c.txt","w") as f3: 
    total_bb=f1.readlines() 
    list_match=f2.readlines() 
    for item_bb in total_bb:
        recordList=re.split(",",item_bb)
        item_bb_w=(recordList[1])+'\n'
        if item_bb_w not in list_match:
            f3.write(item_bb)

Is any tricks to do these two large lists comparison? Thanks

Upvotes: 1

Views: 387

Answers (1)

AChampion
AChampion

Reputation: 30288

Using sets, lookup is O(1):

with open("a.txt", "r") as f1,open("b.txt", "r") as f2,open("c.txt","w") as f3:
    bs = set(b.strip() for b in f2)
    for a in f1:
        x = a.split(',')
        if x[1].strip() not in bs:
            f3.write(a)

I would also look at the csv module for reading comma separated values.

Upvotes: 1

Related Questions