Use Python to find and remove duplicate text in a collection of files

Question

I have a collection of 40-50 text files that contain markdown. Some of them contain duplicate words, sentences, and paragraphs. I'm looking for a script/algorithm to scan the files and help me identify matches (or near matches). Where can I find such a thing? Searching for this type of thing online yielded results for other types of problems, but not this one. Would appreciate any clues to help me narrow my search...

draco1111 · Accepted Answer

basically, a simple brute forces can solve all of your problems. But you should consider another algorithms depend on your requirement (timing, memory,...): Boyer–Moore, Rabin–Karp string search algorithm, Knuth–Morris–Pratt algorithm.

Use Python to find and remove duplicate text in a collection of files

Answers (1)

Related Questions