Reputation: 11
I have to assess pairwise similarities of documents of different sizes (from 300 words to more than 200k words). To do so, I have created a procedure making use of LSA algorithm as implemented in gensim. It includes these steps: document preprocessing, creating BoW vectors, applying TF/IDF weighting, finding topic distributions for documents using LSA, and computing pairwise similarities.
The results I have obtained so far are reasonable to the extent that I was able to verify similarities manually. Nevertheless, I have doubts about the methodological correctness of applying LSA to a corpus of documents of very different sizes. I suspect that LSA might find topic distributions for documents more accurately when documents in a corpus are of comparable lengths (e.g., between 100 and 1500 words), while having documents of very different sizes in the same corpus might reduce accuracy of topic assignment for some documents, leading to inadequate similarity assessment further down the pipeline.
I have looked up papers applying LSA to a similarly structured corpus or discussing this problem methodologically, but found no relevant insights. All papers I have found deal with corpora of similarly sized documents.
Could anybody please point me to relevant research dealing with this problem, reflect on this problem considering the inner workings of LSA, or simply share their own experience of dealing with corpora of documents of mixed sizes? Any insight would be appreciated. If LSA indeed applies best to corpora of similarly sized documents, how can one apply it to a mixed-size corpus? As I see it, one option would be to split large documents into smaller parts, run the procedure, and then average computed similarity values. If this would be a correct way, please let me know.
Upvotes: 1
Views: 55