CLIP: Cosine Similarity of Text and Image Embeddings is low

Question

I am using the HuggingFace CLIP Model for generating text and image embeddings with get_text_features and get_image_features. When calculating the cosine similarity it is surprisingly low - especially compared to other embedding models where I am used to values above 0.65. For example, for a photo of a cat and a dog and the texts ["cat", "dog"] I get the following similarity matrix

urls = [
    "https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg/1920px-Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/thumb/9/90/Labrador_Retriever_portrait.jpg/580px-Labrador_Retriever_portrait.jpg",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
texts = ["cat", "dog"]
with torch.no_grad():
    image_inputs = processor(
        text=texts,
        images=images,
        return_tensors="pt",
        padding=True,
        truncation=True,
    )
with torch.no_grad():
    clip_text_embeddings = model.get_text_features(image_inputs["input_ids"])
    clip_text_embeddings = clip_text_embeddings / clip_text_embeddings.norm(
        dim=-1, keepdim=True
    )
    clip_image_embeddings = model.get_image_features(image_inputs["pixel_values"])
    clip_image_embeddings = clip_image_embeddings / clip_image_embeddings.norm(
        dim=-1, keepdim=True
    )
    clip_cos_sim = torch.mm(clip_text_embeddings, clip_image_embeddings.T)
print(clip_cos_sim)

tensor([[0.2696, 0.1795],
        [0.2288, 0.2597]])

Are these expected values for CLIP or am I doing something wrong here?

CLIP: Cosine Similarity of Text and Image Embeddings is low

Answers (0)

Related Questions