Reputation: 1417
I'm using the open_clip
module to obtain text embeddings from the CLIP model. When I tokenize a list of a single text sequence and pass them to the model's encode_text
method, I expect to get embeddings with a shape of [77, 1024]
. However, I'm getting an output shape of [1, 1024]
.
Here's the relevant code:
import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:laion/CLIP-ViT-H-14-laion2B-s32B-b79K')
tokenizer = open_clip.get_tokenizer('hf-hub:laion/CLIP-ViT-H-14-laion2B-s32B-b79K')
text_inputs = ["cat"]
tokenized_inputs = tokenizer(text_inputs)
print(len(tokenized_inputs)) # This prints 77
text_embeddings = model.encode_text(tokenized_inputs)
print(text_embeddings.shape) # This prints [1, 1024]
Am I missing something in how I'm using the tokenizer or the model's encode_text
method? How can I obtain individual embeddings for each of the 77 token sequence? I am expecting [77, 1024]
Upvotes: 0
Views: 1401
Reputation: 1
You can use the huggingface Clip models (open_clip just wraps around huggingface libraries anyway), which has a output_hidden_states
parameter, which will return the outputs before the pooled layer.
See an example here https://github.com/huggingface/diffusers/blob/2432f80ca37f882af733244df24b46f2d447fbcf/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L323
Upvotes: 0
Reputation: 126
What you want is the output before the pooled layer, the embedding for each token of the sequence, which unfortunately is not returned by the call to the class of open_clip. The options i'm suggesting are
output.last_hidden_state
. You can follow this script to do that
https://gist.github.com/calpt/8e3555bd11f1916b5169c8125117e5eeI also tried directly that:
CLIPTextModel.from_pretrained("laion/CLIP-ViT-H-14-laion2B-s32B-b79K")
I haven't found anything that says this is doable (but I doubt it). It loads the weights, but i guess that if something mismatch is simply ignored silently.
Upvotes: 1