Reputation: 542
I am using the HuggingFace CLIP Model for generating text and image embeddings with get_text_features
and get_image_features
. When calculating the cosine similarity it is surprisingly low - especially compared to other embedding models where I am used to values above 0.65. For example, for a photo of a cat and a dog and the texts ["cat", "dog"]
I get the following similarity matrix
urls = [
"https://upload.wikimedia.org/wikipedia/commons/thumb/6/68/Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg/1920px-Orange_tabby_cat_sitting_on_fallen_leaves-Hisashi-01A.jpg",
"https://upload.wikimedia.org/wikipedia/commons/thumb/9/90/Labrador_Retriever_portrait.jpg/580px-Labrador_Retriever_portrait.jpg",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in urls]
texts = ["cat", "dog"]
with torch.no_grad():
image_inputs = processor(
text=texts,
images=images,
return_tensors="pt",
padding=True,
truncation=True,
)
with torch.no_grad():
clip_text_embeddings = model.get_text_features(image_inputs["input_ids"])
clip_text_embeddings = clip_text_embeddings / clip_text_embeddings.norm(
dim=-1, keepdim=True
)
clip_image_embeddings = model.get_image_features(image_inputs["pixel_values"])
clip_image_embeddings = clip_image_embeddings / clip_image_embeddings.norm(
dim=-1, keepdim=True
)
clip_cos_sim = torch.mm(clip_text_embeddings, clip_image_embeddings.T)
print(clip_cos_sim)
tensor([[0.2696, 0.1795],
[0.2288, 0.2597]])
Are these expected values for CLIP or am I doing something wrong here?
Upvotes: 1
Views: 855