Reputation: 1423
I have several text files with SINGLE COLUMNS inside the directory . I have to combine all them into one text file by removing the duplicate lines. I am doing with the following code. However, my text files are extremely large. So what is the best and fastest way of doing it?
import os, glob
files = glob.glob('*.txt')
with open('combinedfile.txt','w') as fo:
all_lines = []
for f in files:
with open(f,'r') as fi:
all_lines.append(fi.read())
all_lines = set(all_lines)
for item in all_lines:
fo.write(item + '\n')
Upvotes: 2
Views: 3419
Reputation: 3595
You were saving the complete file content and not the individual lines so you would never find duplicates. I converted this to readlines
. When writing you can join the text first and have one write which should give you some extra performance.
import os, glob
files = glob.glob('*.txt')
all_lines = []
for f in files:
with open(f,'r') as fi:
all_lines += fi.readlines()
all_lines = set(all_lines)
with open('combinedfile.txt','w') as fo:
fo.write("\n".join(all_lines))
Upvotes: 1