How to find best match with sklearn pipeline in Python

Question

I've got a Pipeline setup using a TfidfVectorizer and TruncatedSVD. I train the models with sklearn and calculate the distance between two vectors using the cosine similarity. Here's my code:

def create_scikit_corpus(leaf_names=None):

    vectorizer = TfidfVectorizer(
        tokenizer=Tokenizer(),
        stop_words='english',
        use_idf=True,
        smooth_idf=True
    )

    svd_model = TruncatedSVD(n_components=300,
                             algorithm='randomized',
                             n_iterations=10,
                             random_state=42)
    svd_transformer = Pipeline([('tfidf', vectorizer),
                                ('svd', svd_model)])

    svd_matrix = svd_transformer.fit_transform(leaf_names)

    logging.info("Models created")

    test = "This is a test search query."
    query_vector = svd_transformer.transform(test)
    distance_matrix = pairwise_distances(query_vector, svd_matrix, metric='cosine')


    return svd_transformer, svd_matrix

The thing is that I'm not sure what to do once I have the distance_matrix variable. I guess I'm kinda confused on exactly what that is.

I'm trying to find which document matches best with my query. Thanks for a push in the right direction!

ldirer · Accepted Answer

Once you have the distance_matrix computed, you can find the closest singular vector according to the cosine similarity... And that might be the reason you are confused: what does this singular vector represent?

The problem is that this answer is not straightforward, the singular vector is usually not a document in your corpus.

If what you want is the best match as in "the document from your corpus that is the most similar to this one", there is something simpler to do: pick the closest document according to cosine similarity. You do not need SVD for this approach.

How to find best match with sklearn pipeline in Python

Answers (1)

Related Questions