Reputation: 309
I'm thinking to write a simple research paper manager. The idea is to have a repository containing for each paper its metadata
paper_id -> [title, authors, journal, comments...]
Since it would be nice to have the possibility to import the paper dump of a friend, I'm thinking on how to generate the paper_id of a paper: IMHO should be produced by the text of the pdf, to garantee that two different collections have the same ids only for the same papers. At the moment, I extract the text of the first page using the iText library (removing the possible annotations), and i compute a simhash footprint from the text. the main problem is that sometime text is slightly different (yes, it happens! for example this and this) so i would like to be tolerant. With simhash i can compute how much the are similar the original document, so in case the footprint is not in the repo, i'll have to iterate over the collection looking for 'near' footprints.
I'm not convinced by this method, could you suggest some better way to produce a signature (short, numerical or alphanumerical) for those kind of documents?
UPDATE I had this idea: divide the first page in 8 (more or less) not-overlapping squares, covering all the page, then consider the text in each square and generate a simhash signature. At the end I'll have a 8x64=512bit signature and I can consider two papers the same if the sum of the differences between their simhash signatures sets is under a certain treshold.
Upvotes: 1
Views: 270
Reputation: 8206
In case you actually have a function that inputs two texts and returns a measure of their similarity, you do not have to iterate the entire Repository. Given an article that is not in the repository, you can iterate only articles that have approximately the same length. for example, given an article that have 1000 characters, you will compare it to articles having between 950 and 1050 characters. For this you will need to have a data structure that maps ranges to articles and you will have to fine tune the size of the range. Range too large- too many items in each range. Range too small- higher potential of a miss.
Of course this will fail on some edge cases. For example, if you have two documents that the second is simply the first that was copy pasted twice: you would probably want them to be considered equal, but you will not even compare them since they are too far apart in length. There are methods to deal with that also, but you probably 'Ain't gonna need it'.
Upvotes: 1