Veroniky
Veroniky

Reputation: 11

Compare vectors of a doc and just a word

So, I have to compare vector of article and vector of single word. And I don't have any idea how to do it. Looks like that BERT and Doc2wec good work with long text, Word2vec works with single words. But how to compare long text with just a word?

Upvotes: 1

Views: 642

Answers (3)

gojomo
gojomo

Reputation: 54153

Some modes of the "Paragraph Vector" algorithm (aka Doc2Vec in libraries like Python gensim) will train both doc-vectors and word-vectors into the a shared coordinate space. (Specifically, any of the PV-DM dm=1 modes, or the PV-DBOW mode dm=0 if you enable the non-default interleaved word-vector training using dbow_words=1.)

In such a case, you can compare Doc2Vec doc-vectors with the co-trained word-vectors, with some utility. You can see some examples in the followup paper form the originators of the "Paragraph Vector" algorithm, "Document Embedding with Paragraph Vectors".

However, beware that vectors for single words, having been trained in contexts of use, may not have vectors that match what we'd expect of those same words when intended as overarching categories. For example, education as used in many sentences wouldn't necessarily assume all facets/breadth that you might expect from Education as a category-header.

Such single word-vectors might work better than nothing, and perhaps help serve as a bootstrapping tool. But, it'd be better if you had expert-labelled examples of documents belonging to categories of interest. Then you you could also use more advanced classification algorithms, sensitive to categories that wouldn't necessarily be summarized-by (and in a tight sphere around) any single vector point. In real domains-of-interest, that'd likely do better than using single-word-vectors as category-anchors.

For any other non-Doc2Vec method of vectorizing a text, you could conceivably get a comparable vector for a single word by supplying a single-word text to the method. (Even in a Doc2Vec mode that doesn't create word-vectors, like pure PV-DBOW, you could use that model's out-of-training-text inference capability to infer a doc-vector for a single-word doc, for known words.)

But again, such simplified/degenerate single-word outputs might not well match the more general/textured categories you're seeking. The models are more typically used for larger contexts, and narrowing their output to a single word might reflect the peculiarities of that unnatural input case moreso than the usual import of the word in real context.

Upvotes: 3

Evan Mata
Evan Mata

Reputation: 612

Based on your further comments that explain your problem a bit more, it sounds like you're actually trying to do Topic Modelling (categorizing documents by a given word is equivalent to labeling them with that topic). If this is what you're doing, I would recommend looking into LDA and variants of it (eg guidedLDA as an example).

Upvotes: 1

Separius
Separius

Reputation: 1296

You can use BERT as is for words too. a single word is just a really short sentence. so, in theory, you should be able to use any sentence embedding as you like.

But if you don't have any supervised data, BERT is not the best option for you and there are better options out there!

I think it's best to first try doc2vec and if it didn't work then switch to something else like SkipThoughts or USE.

Sorry that I can't help you much, it's completely task and data dependent and you should test different things.

Upvotes: 1

Related Questions