Reputation: 21
When i wanna find out how clip process,i'm confused why 768, how to make a text embedding in 77 × 768,i know that 77 is the max_length of token, which transferred characters by tokenizer.but i really don't understand how make a text become 768
in https://huggingface.co/docs/transformers/model_doc/clip, it describe hidden_size (int, optional, defaults to 768) — Dimensionality of the encoder layers and the pooler layer. As shown below 768 But I don't know why 768, or where can I find the source code of the dimension changed to 768
Upvotes: 2
Views: 13483
Reputation: 476
CLIP is not "always
[77, 768]
".
Each CLIP has a text-model
and image-model
. The embedding shape for each model varies. Also, embeddings for image and text are different.
There are 4 models of CLIP by OpenAI in Huggingface: (image_size
, patch_size
, image_hidden_size
, text_hidden_size
, proj_dim
openai/clip-vit-base-patch32
(600M): 224, 32, 768, 512, 512openai/clip-vit-base-patch16
(600M): 224, 16, 768, 512, 512openai/clip-vit-large-patch14
(1.7G): 224, 14, 1024, 768, 768openai/clip-vit-large-patch14-336
(1.7G): 336, 14, 1024, 768, 768Text Model: The number 77
is related to the maximum number of text tokens (unsure whether it includes EOT
token). Only one EOT
token of shape [768]
is selected for cos similarity calculation. The number of tokens for text-model
is the same 77
across different models.
Image Model: Each CLIP model has a different image-model
architecture and thus, a different embedding size for image. The number of tokens for image-model
is (image_size / patch_size)**2 + 1
where +1
represent CLS
token.
CLIP Training: Notice a mismatch between image_hidden_size
, text_hidden_size
for every model. Here is how it work in clip-vit-large-patch14
: (1) After transformers, images has shape [B, 197, 1024]
and texts has shape [B, 77, 768]
. (2) Then we select CLS
for images and EOT
for texts. The shapes become [B, 1024]
and [B, 768]
. (3) They are each projected by a linear layer. The shapes become [B, 768]
and [B, 768]
. (4) Then we dot-product to get shape [B, B]
similarities.
Text Conditioning: Stable Diffusion 1.x/2.x models uses all 77 text tokens before pooling. For non-critical tasks, one could in theory use only the pooled EOT
token.
Image Conditioning: IPAdapter uses pooled image embedding (in the last transformer layer) while IPAdapterPlus uses full image embeddings in the second to the last transformer layers (possibly because the last transformer isn't meant to produce meaningful output for last layer embeddings other than the CLS
token).
Hope this clears things up!
Upvotes: 0
Reputation: 126
TLDR: Different models have different sizes of embeddings and 768 is a nice number.
Let's not confuse the CLIPTextModel with the CLIPVisionModel as you can see from a model's configs they have different sizes. The Text model takes text which is tokenized (as you said with max_position_embeddings
=77) and then goes through an Embedding
linear layer of vocab_size
=49480xhidden_size
=768. Now that you've got (batch,seq,hidden) pass it through the encoder (transformer blocks with attention) which does not change the size. Then to pool the sequence dimension, you ause only the eos token obtaining an embedding vector for the whole sequence and finally a Projection
linear layer to project it to the predefined projection_dim
which is the dimension size shared with the vision encoder.
(Note that I simplified the things like skipping norm layers)
The vision model is basically the same, but with its own hidden_size
=1024 and instead of tokenizer it uses only the Embedding
linear layer which does the job of patch embedder, obtaining an embedding vector (of size hidden_size
) for each patch. Flatten the patches and you have the (batch,seq,hidden) to pass to the encoder then it's the same as above. Instead of the eos there you use the class token.
Upvotes: 0
Reputation: 99
Both the image and text encoders rely on embeddings per token internally, though token means something different for the text and image encoders. CLIP is part of a trend of using 'image patch token embeddings'.
This is what the 77x768 is - its the embedding per token.
In the text encoder the text is tokenized into a series of integers each representing a unique series of characters. This is then directly translated to a token embedding of 768 dimensions through a learned token embedding - each possible vocabulary token has some direct representation.
For the image encoder the image is broken into 77 small patches each receiving their own encoding. They do this patch-cutting through a convolution layer with stride equal to the patch size so that they don't overlap. The patches don't exist in the input data directly but the convolutional layer 'cuts it up' and then learns a representation of each patch (this is the per-patch embedding).
In both encoders the embedding is then mixed with a positional embedding which says from where in the input sequence the token or patch came from.
Now both of them are 77x768. 77 text tokens made into embeddings, or 77 image patches made into embeddings. So how does it end up as 768? Projections. Both the vision and image encoding parts of the model have a projection vector of shape (transformer width x embeddings dimensions), which due to matrix multiplication turns the matrix from 77x768 to 1x768. Transformer width here means how big of an input the transformer can handle, and is 77 in this case.
Once that multiplication is done, the output is 1x768 for both of them, and this is how they train the model - by ensuring the 1x768 embedding of the image and the 1x768 embedding of the text of a text-image pair are close together.
Image projection - https://github.com/openai/CLIP/blob/main/clip/model.py#L221
Text projection - https://github.com/openai/CLIP/blob/main/clip/model.py#L294
In short, the 77x768 in both the vision transformer and the text embedding becomes 1x768 through matrix multiplying it with a projection vector of size 77x768 which is learned along with the rest of the model.
Please note I did not stay consistent with which axis had what exact dimension in the matrixes above (e.g. technically 77x768 can't be multiplied with 77x768 - one has to be transposed) - but what I'm trying to convey is that it's this final multiplication that reduces it down to one dimension. For those reading this without knowledge of linear algebra and matrices - matrix multiplication is very different from normal multiplication.
Some implementations like in Transformers call this projection layer the 'pooling layer' and in the output the output of the final projection is called the 'pooler output', but the original implementation calls it the projection layer so that's what I've called it here. Either way the function is the same.
Upvotes: 1
Reputation: 61
768 comes from the embedding of ViT used by CLIP. In ViT, it transform the input image of 224 * 224 pixels, to patches of size 16 * 16 pixels. Therefore, when you embed (flatten and use an MLP) the patches with size 16 * 16 * 3 (RGB) = 768. For the text encoder, to match the embedding of images, they use 768 as well to calculate the pair-wise similarity in CLIP.
Upvotes: 6