Loic Duros
Loic Duros

Reputation: 5782

matching long strings (mostly similar) with long strings

I'm trying to find the best way to match and recognize different license texts within files. These are pretty long multiline strings (sometimes 2 letter-size pages) and if matching they should be mostly the same except for a few variables (Name, date, odd new lines, odd spaces). My question is, what's the best way to match long strings in even longer strings? Is the use of regular expressions justified for that (a huge regexp that would contain the whole license text with a few wildcards for the variable elements)? Or is there a string searching/matching algorithm that would be particularly adapted?

Upvotes: 0

Views: 825

Answers (2)

mcdowella
mcdowella

Reputation: 19601

Most regular expression libraries are tuned to be fast in practice on the sorts of regular expressions people usually write, sometimes ignoring rare cases where you can construct regular expressions which are cause them to take horrendous amounts of time. If your pattern is not one of the horrible special cases, it probably doesn't matter much how long it is because, in practice, most of the places where it fails to match the text can be recognised by checking only a few characters of the text and pattern, and these mismatches are where the time goes.

If you want to be sure of good performance, I would look for a single string of fixed text (as large as possible) that must exist in all forms of the license, search for this in the files, and then recheck the few occurrences of this in some more complex way to see if are true matches or not. But it is very likely that this is pretty much what will happen, in practice, if you do an ordinary regex search. Why not let your regex loose on the files, or on a subset of them, and see how long it takes?

Upvotes: 0

Ivan Bianko
Ivan Bianko

Reputation: 1769

Read about Longest common subsequence of 2 strings. Algorithm based on Dynamic programming

Upvotes: 1

Related Questions