dr. evil
dr. evil

Reputation: 27275

How to understand if the static part of the text has been changed? (diff algorithm related)

First of all this is tough thing to solve, so far I didn't come up with a good example but I hope someone here will figure this out. I hope there is known way to solve these kind of problems, or an obscure algorithm.

Scenario:

Challenge

What I've tried

Limitations and weaknesses of this algorithm is pretty obvious. Although I've got some good results in some cases, but it doesn't work as expected all the time.

My current class works like this:

Dim Analyser AS NEW ContentAnalyzer()
Analyser.AddTrueCase(True1Html)
Analyser.AddTrueCase(True2Html)
Analyser.AddTrueCase(True3Html)

'This will return True if the UnknownHtml is similar to TRUE case, otherwise False
Analyser.IsThisTrue(UnknownHtml) 

Sorry the title doesn't make much sense, I couldn't find a good way to describe it.

Upvotes: 2

Views: 158

Answers (3)

Svante
Svante

Reputation: 51511

Perhaps you mean something like Bayesian Filtering? You could look at what Paul Graham has done with Spam: http://www.paulgraham.com/better.html

Upvotes: 0

RossFabricant
RossFabricant

Reputation: 12492

It sounds like you're doing fairly simple document classification. This is a heavily researched field, especially lately due to spam filters. Look into a library for document classification in your language of choice.

Classifier4j looks like a popular library that runs on the Java VM and has been ported to .NET.

Upvotes: 2

JB King
JB King

Reputation: 11910

Either this is really misstated or I'm just not getting something:

The application requests the web page and gets it and has to ascertain if it is another "True" or "False", right? This is to say that part of the web request isn't to return the true or false at the beginning which is where my first confusion is.

Secondly, why aren't you doing a similar comparison on the false cases and seeing if there are sufficient similarities to create 3 buckets of results for some random page requested:

1) Page is more similar to true and thus is viewed as true.

2) Page is more similar to false and thus is viewed as false.

3) Page isn't more similar to either and thus the result is something like a null or exception situation as it isn't possible to discern which result makes sense.

Example of where that 3rd case could happen: Suppose the page contains an integer and if positive the result is true and if negative the result is false. What if the result is 0? Does 0 count as positive since it is equal to its absolute value or does it count as a negative for some reason?

Or am I way off in what you are trying to do here?

Upvotes: 1

Related Questions