Daniel Rikowski
Daniel Rikowski

Reputation: 72514

Approaches for reversing sprintf/format

I have to heuristically determine the format pattern strings by analyzing the formatted results.

For example I have these strings:

You have 3 unread messages.

You have 10 unread messages.

I'm sorry, Dave. I'm afraid I can't do that.

I'm sorry, Frank. I'm afraid I can't do that.

This statement is false.

I want to derive these format strings:

You have %s unread messages

I'm sorry, %s. I'm afraid I can't do that.

This statement is false.

Which approaches and/or algorithms could help me here?

My first thought was using machine learning stuff, but my guts tell me this could be a rather classic problem.

Some additional requirements:

Upvotes: 6

Views: 733

Answers (1)

Fred Foo
Fred Foo

Reputation: 363597

  1. Cluster the strings by some metric of similarity (I'd try length of longest common subsequence, LCS). Determining the number of clusters is the hard part, if you don't know it beforehand.

  2. Within each cluster, determine the LCS of all strings in it, recording the position of the gaps that occur. Replace the gaps with %s. (You may want to build a function that returns an LCS-based format string and fold/reduce that over the cluster.)

The above is a greedy algorithm that, given {foobar, fooBaR} produces foo%sa%s. You may want to replace any pair of occurrences of %s separated by a single character (or a single non-whitespace char, etc) by a single %s, recursively.

Upvotes: 1

Related Questions