rolfedh
rolfedh

Reputation: 281

Use Python to find and remove duplicate text in a collection of files

I have a collection of 40-50 text files that contain markdown. Some of them contain duplicate words, sentences, and paragraphs. I'm looking for a script/algorithm to scan the files and help me identify matches (or near matches). Where can I find such a thing? Searching for this type of thing online yielded results for other types of problems, but not this one. Would appreciate any clues to help me narrow my search...

Upvotes: 0

Views: 295

Answers (1)

draco1111
draco1111

Reputation: 300

basically, a simple brute forces can solve all of your problems. But you should consider another algorithms depend on your requirement (timing, memory,...): Boyer–Moore, Rabin–Karp string search algorithm, Knuth–Morris–Pratt algorithm.

Upvotes: 1

Related Questions