Borys
Borys

Reputation: 1423

fastest way to combine several text files without duplicate lines

I have several text files with SINGLE COLUMNS inside the directory . I have to combine all them into one text file by removing the duplicate lines. I am doing with the following code. However, my text files are extremely large. So what is the best and fastest way of doing it?

import os, glob
files = glob.glob('*.txt')

with open('combinedfile.txt','w') as fo:
    all_lines = []
    for f in files:
        with open(f,'r') as fi:
            all_lines.append(fi.read())
    all_lines = set(all_lines)

    for item in all_lines:
        fo.write(item + '\n')

Upvotes: 2

Views: 3419

Answers (1)

k-nut
k-nut

Reputation: 3595

You were saving the complete file content and not the individual lines so you would never find duplicates. I converted this to readlines. When writing you can join the text first and have one write which should give you some extra performance.

import os, glob
files = glob.glob('*.txt')

all_lines = []
for f in files:
    with open(f,'r') as fi:
        all_lines += fi.readlines()
all_lines = set(all_lines)

with open('combinedfile.txt','w') as fo:
    fo.write("\n".join(all_lines))

Upvotes: 1

Related Questions