Reputation: 129
As part of an ambitious project, I am attempting to better understand the legislative text that is written into bills introduced in the U.S. Congress. I have electronic versions of recent bills, and am attempting to implement an algorithm that would compare a bill with prior bills, looking for similarities. The hypothesis is that many bills that fail end up getting co-opted into other bills.
Obviously, this is a large task. Many questions exist regarding difference engines, but my issue is slightly different. Many times bills are introduced that package several ideas together. So the difference engine would need to compare portions of bills, not the entire bills.
Any recommendations on difference algorithms or a method to go about doing this? I have access to serious computational power, but do keep in mind that I will be using a dataset of about 100,000 bills.
Upvotes: 2
Views: 137
Reputation: 56129
Very interesting idea. I would start by looking into longest common subsequence algorithms, and see about adapting them to (1) report any sequence over some threshold, say, 20 words, and (2) see if you can get them to handle a bit of fuzziness, in case a word or two gets changed. I'd suggest looking at the diff code to start.
Upvotes: 1
Reputation: 300719
Take a look at Simian - Similarity Analyser. It works for plain text as well as code.
Upvotes: 1