fastest way to combine several text files without duplicate lines

Question

I have several text files with SINGLE COLUMNS inside the directory . I have to combine all them into one text file by removing the duplicate lines. I am doing with the following code. However, my text files are extremely large. So what is the best and fastest way of doing it?

import os, glob
files = glob.glob('*.txt')

with open('combinedfile.txt','w') as fo:
    all_lines = []
    for f in files:
        with open(f,'r') as fi:
            all_lines.append(fi.read())
    all_lines = set(all_lines)

    for item in all_lines:
        fo.write(item + '
')

k-nut · Accepted Answer

You were saving the complete file content and not the individual lines so you would never find duplicates. I converted this to readlines. When writing you can join the text first and have one write which should give you some extra performance.

import os, glob
files = glob.glob('*.txt')

all_lines = []
for f in files:
    with open(f,'r') as fi:
        all_lines += fi.readlines()
all_lines = set(all_lines)

with open('combinedfile.txt','w') as fo:
    fo.write("
".join(all_lines))

fastest way to combine several text files without duplicate lines

Answers (1)

Related Questions