ohhiohhi W
ohhiohhi W

Reputation: 21

why the text embedding or image embedding generated by clip model is 768 × n

When i wanna find out how clip process,i'm confused why 768, how to make a text embedding in 77 × 768,i know that 77 is the max_length of token, which transferred characters by tokenizer.but i really don't understand how make a text become 768

in https://huggingface.co/docs/transformers/model_doc/clip, it describe hidden_size (int, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer. As shown below 768 But I don't know why 768, or where can I find the source code of the dimension changed to 768

Upvotes: 2

Views: 13483

Answers (4)

Koke Cacao
Koke Cacao

Reputation: 476

CLIP is not "always [77, 768]".

Each CLIP has a text-model and image-model. The embedding shape for each model varies. Also, embeddings for image and text are different.


There are 4 models of CLIP by OpenAI in Huggingface: (image_size, patch_size, image_hidden_size, text_hidden_size, proj_dim

  • openai/clip-vit-base-patch32 (600M): 224, 32, 768, 512, 512
  • openai/clip-vit-base-patch16 (600M): 224, 16, 768, 512, 512
  • openai/clip-vit-large-patch14 (1.7G): 224, 14, 1024, 768, 768
  • openai/clip-vit-large-patch14-336 (1.7G): 336, 14, 1024, 768, 768

Text Model: The number 77 is related to the maximum number of text tokens (unsure whether it includes EOT token). Only one EOT token of shape [768] is selected for cos similarity calculation. The number of tokens for text-model is the same 77 across different models.

Image Model: Each CLIP model has a different image-model architecture and thus, a different embedding size for image. The number of tokens for image-model is (image_size / patch_size)**2 + 1 where +1 represent CLS token.

CLIP Training: Notice a mismatch between image_hidden_size, text_hidden_size for every model. Here is how it work in clip-vit-large-patch14: (1) After transformers, images has shape [B, 197, 1024] and texts has shape [B, 77, 768]. (2) Then we select CLS for images and EOT for texts. The shapes become [B, 1024] and [B, 768]. (3) They are each projected by a linear layer. The shapes become [B, 768] and [B, 768]. (4) Then we dot-product to get shape [B, B] similarities.

Text Conditioning: Stable Diffusion 1.x/2.x models uses all 77 text tokens before pooling. For non-critical tasks, one could in theory use only the pooled EOT token.

Image Conditioning: IPAdapter uses pooled image embedding (in the last transformer layer) while IPAdapterPlus uses full image embeddings in the second to the last transformer layers (possibly because the last transformer isn't meant to produce meaningful output for last layer embeddings other than the CLS token).


Hope this clears things up!

Upvotes: 0

Giovanni Minelli
Giovanni Minelli

Reputation: 126

TLDR: Different models have different sizes of embeddings and 768 is a nice number.

Let's not confuse the CLIPTextModel with the CLIPVisionModel as you can see from a model's configs they have different sizes. The Text model takes text which is tokenized (as you said with max_position_embeddings=77) and then goes through an Embedding linear layer of vocab_size=49480xhidden_size=768. Now that you've got (batch,seq,hidden) pass it through the encoder (transformer blocks with attention) which does not change the size. Then to pool the sequence dimension, you ause only the eos token obtaining an embedding vector for the whole sequence and finally a Projection linear layer to project it to the predefined projection_dim which is the dimension size shared with the vision encoder. (Note that I simplified the things like skipping norm layers)

The vision model is basically the same, but with its own hidden_size=1024 and instead of tokenizer it uses only the Embedding linear layer which does the job of patch embedder, obtaining an embedding vector (of size hidden_size) for each patch. Flatten the patches and you have the (batch,seq,hidden) to pass to the encoder then it's the same as above. Instead of the eos there you use the class token.

Upvotes: 0

Samuel Anttila
Samuel Anttila

Reputation: 99

Both the image and text encoders rely on embeddings per token internally, though token means something different for the text and image encoders. CLIP is part of a trend of using 'image patch token embeddings'.

This is what the 77x768 is - its the embedding per token.

In the text encoder the text is tokenized into a series of integers each representing a unique series of characters. This is then directly translated to a token embedding of 768 dimensions through a learned token embedding - each possible vocabulary token has some direct representation.

For the image encoder the image is broken into 77 small patches each receiving their own encoding. They do this patch-cutting through a convolution layer with stride equal to the patch size so that they don't overlap. The patches don't exist in the input data directly but the convolutional layer 'cuts it up' and then learns a representation of each patch (this is the per-patch embedding).

In both encoders the embedding is then mixed with a positional embedding which says from where in the input sequence the token or patch came from.

Now both of them are 77x768. 77 text tokens made into embeddings, or 77 image patches made into embeddings. So how does it end up as 768? Projections. Both the vision and image encoding parts of the model have a projection vector of shape (transformer width x embeddings dimensions), which due to matrix multiplication turns the matrix from 77x768 to 1x768. Transformer width here means how big of an input the transformer can handle, and is 77 in this case.

Once that multiplication is done, the output is 1x768 for both of them, and this is how they train the model - by ensuring the 1x768 embedding of the image and the 1x768 embedding of the text of a text-image pair are close together.

Image projection - https://github.com/openai/CLIP/blob/main/clip/model.py#L221

Text projection - https://github.com/openai/CLIP/blob/main/clip/model.py#L294

In short, the 77x768 in both the vision transformer and the text embedding becomes 1x768 through matrix multiplying it with a projection vector of size 77x768 which is learned along with the rest of the model.

Please note I did not stay consistent with which axis had what exact dimension in the matrixes above (e.g. technically 77x768 can't be multiplied with 77x768 - one has to be transposed) - but what I'm trying to convey is that it's this final multiplication that reduces it down to one dimension. For those reading this without knowledge of linear algebra and matrices - matrix multiplication is very different from normal multiplication.

Some implementations like in Transformers call this projection layer the 'pooling layer' and in the output the output of the final projection is called the 'pooler output', but the original implementation calls it the projection layer so that's what I've called it here. Either way the function is the same.

Upvotes: 1

Mahdi
Mahdi

Reputation: 61

768 comes from the embedding of ViT used by CLIP. In ViT, it transform the input image of 224 * 224 pixels, to patches of size 16 * 16 pixels. Therefore, when you embed (flatten and use an MLP) the patches with size 16 * 16 * 3 (RGB) = 768. For the text encoder, to match the embedding of images, they use 768 as well to calculate the pair-wise similarity in CLIP.

Upvotes: 6

Related Questions