Algorithms for calculating the similarity of numerous documents (e.g. books of the Bible)

Question

My goal is to process the Bible in a way that enables calculating the relative similarity of any two books of the Bible. Ideally, two books should score higher if their word distributions are similar, but also if they have more phrases in common. For example, the book of Matthew borrows heavily from the book of Mark, but is about twice the length, and while numerous passages are duplicated verbatim, the order of the duplicated verses is not consistent.

It would be great if this could be done hierarchically; verses processed individually, aggregated into chapters and then finally books. Given a verse, it would be good to be able to retrieve a ranked list of similar verses and so on with chapters and books.

If the system could give partial credit for similar words (walk, walked, walking) that would also be good.

Once completed I would like to extend this to any set of documents.

So far, I am considering storing each word as an inverted index in a graph database, and then using graph algorithms to score the similarity of the graphs, but I don't know what algorithm to use for the scoring (Collaborative Filtering?).

Something like Levenstein Distance or BK-Trees may be helpful (for fuzzy matching) but seem inadequate for a total solution. Perhaps preprocessing the words through the BK-Tree and using the results to add additional links into the graph can help achieve the fuzzy matching capability.

mcdowella · Accepted Answer

Word frequency similarities include http://en.wikipedia.org/wiki/Cosine_similarity http://en.wikipedia.org/wiki/Jaccard_index (note the reference to http://en.wikipedia.org/wiki/MinHash - you could use this with phrases) The http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence is not symmetric.

As long as all you are interested in is word or phrase frequency, you only need counts, and with MinHash you only need selected counts. If you know something about the language in question, you might be able to look at similar words by reducing each word to a root. For English, you might perhaps get language info from something like http://en.wikipedia.org/wiki/Wordnet#Other_languages. Don't know about Hebrew/New Testament Greek.

Where you have large chunks duplicated between two documents, you can use suffix arrays - one jumping off point is http://algs4.cs.princeton.edu/63suffix/

Algorithms for calculating the similarity of numerous documents (e.g. books of the Bible)

Answers (1)

Related Questions