drdre
drdre

Reputation: 51

Query-document similarity with doc2vec

Given a query and a document, I would like to compute a similarity score using Gensim doc2vec. Each document consists of multiple fields (e.g., main title, author, publisher, etc)

For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?

For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?

Upvotes: 0

Views: 478

Answers (1)

gojomo
gojomo

Reputation: 54163

The right answer will depend on your data & user behavior, so you'll want to try several variants.

Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).

You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.

Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.

Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)

In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.

Upvotes: 2

Related Questions