hanugm
hanugm

Reputation: 1417

CLIP model from `open_clip` module returns single embedding for 77 tokens

I'm using the open_clip module to obtain text embeddings from the CLIP model. When I tokenize a list of a single text sequence and pass them to the model's encode_text method, I expect to get embeddings with a shape of [77, 1024]. However, I'm getting an output shape of [1, 1024].

Here's the relevant code:

import open_clip

model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-H-14-laion2B-s32B-b79K')
tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-H-14-laion2B-s32B-b79K')

text_inputs = ["cat"]  
tokenized_inputs = tokenizer(text_inputs)
print(len(tokenized_inputs))  # This prints 77

text_embeddings = model.encode_text(tokenized_inputs)
print(text_embeddings.shape)  # This prints [1, 1024]

Am I missing something in how I'm using the tokenizer or the model's encode_text method? How can I obtain individual embeddings for each of the 77 token sequence? I am expecting [77, 1024]

Upvotes: 0

Views: 1401

Answers (2)

SwayStar123
SwayStar123

Reputation: 1

You can use the huggingface Clip models (open_clip just wraps around huggingface libraries anyway), which has a output_hidden_states parameter, which will return the outputs before the pooled layer.

See an example here https://github.com/huggingface/diffusers/blob/2432f80ca37f882af733244df24b46f2d447fbcf/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L323

Upvotes: 0

Giovanni Minelli
Giovanni Minelli

Reputation: 126

What you want is the output before the pooled layer, the embedding for each token of the sequence, which unfortunately is not returned by the call to the class of open_clip. The options i'm suggesting are

  • import in your code and modify the source of the library to get as output whatever you necessitate from the middle of the computation
  • convert the OpenCLIP checkpoint to something usable in transformers library then you can use standard class. For instance using a CLIPTextModel you can access the wanted outout in output.last_hidden_state. You can follow this script to do that https://gist.github.com/calpt/8e3555bd11f1916b5169c8125117e5ee

I also tried directly that:

CLIPTextModel.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")

I haven't found anything that says this is doable (but I doubt it). It loads the weights, but i guess that if something mismatch is simply ignored silently.

Upvotes: 1

Related Questions