Fred Milton
Fred Milton

Reputation: 129

Diff Algorithm for Legislation

As part of an ambitious project, I am attempting to better understand the legislative text that is written into bills introduced in the U.S. Congress. I have electronic versions of recent bills, and am attempting to implement an algorithm that would compare a bill with prior bills, looking for similarities. The hypothesis is that many bills that fail end up getting co-opted into other bills.

Obviously, this is a large task. Many questions exist regarding difference engines, but my issue is slightly different. Many times bills are introduced that package several ideas together. So the difference engine would need to compare portions of bills, not the entire bills.

Any recommendations on difference algorithms or a method to go about doing this? I have access to serious computational power, but do keep in mind that I will be using a dataset of about 100,000 bills.

Upvotes: 2

Views: 137

Answers (2)

Kevin
Kevin

Reputation: 56129

Very interesting idea. I would start by looking into longest common subsequence algorithms, and see about adapting them to (1) report any sequence over some threshold, say, 20 words, and (2) see if you can get them to handle a bit of fuzziness, in case a word or two gets changed. I'd suggest looking at the diff code to start.

Upvotes: 1

Mitch Wheat
Mitch Wheat

Reputation: 300719

Take a look at Simian - Similarity Analyser. It works for plain text as well as code.

Upvotes: 1

Related Questions