Kyo
Kyo

Reputation: 21

Extracting embeddings from CLIP's intermediate layers

I'm experimenting with the CLIP model. I loaded a pretrained model and wanted to see how the embeddings look like at intermediate layers. The code I used is as below:

dataset = CelebADataset(root_dir="celeba/img_align_celeba", transform=preprocess)
dataloader = DataLoader(dataset, batch_size=32, shuffle=False)


device = "cuda" if torch.cuda.is_available() else "cpu"
model, _ = clip.load("ViT-B/32", device=device)

features = {}
def hook_fn(module, input, output):
    features[module] = output[:, 0, :]

# model.patch_embed.register_forward_hook(hook_fn) # Don't register hook for patch_embed layer

for i, block in enumerate(model.visual.transformer.resblocks):
    block.register_forward_hook(hook_fn)  # Each block

all_features = {
    # 'patch_embed': [],
    'block_0': [],
    'block_1': [],
    'block_2': [],
    'block_3': [],
    'block_4': [],
    'block_5': [],
    'block_6': [],
    'block_7': [],
    'block_8': [],
    'block_9': [],
    'block_10': [],
    'block_11': [],
    'final': []
}
all_labels = []

with torch.no_grad():
    for inputs in tqdm(dataloader):
        inputs = inputs.to(device)
        final_output = model.encode_image(inputs)

        # Convert features to numpy and store
        for i in range(12):
            all_features[f'block_{i}'].append(features[model.visual.transformer.resblocks[i]].cpu().numpy())
        all_features['final'].append(final_output.cpu().numpy())

        # all_labels.append(labels.cpu().numpy())

for key in all_features:
    all_features[key] = np.concatenate(all_features[key], axis=0)
# all_labels = np.concatenate(all_labels, axis=0)

np.save("celeba_block0.npy", all_features[f'block_{0}'])
np.save("celeba_block1.npy", all_features[f'block_{1}'])
...

I had done similar things with Dino before, and after doing dimension reduction on Dino's embeddings, I can see images in different label groups form distinct clusters. However when I checked the embeddings from CLIP, I didn't see clear clusters except from the final embeddings.

Is this because CLIP's network structure is different from Dino or my code is wrong?

I tried to look at the structure of CLIP but I couldn't figure out an explanation to this.

Upvotes: 0

Views: 68

Answers (0)

Related Questions