danywigglebutt
danywigglebutt

Reputation: 269

Semantic video search

I am attempting to extend OpenAI's CLIP functionality to semantic video search. Essentially, my objective is to input a text query and get relevant video segments/clips that match the semantic content of the text query. Here's what I've thought so far:

  1. Extract frames from the video at regular intervals.
  2. Use CLIP to create embeddings of these frames and the text query.
  3. Compare the text query embeddings with the video frame embeddings to find matches.

However, this approach seems quite naive, and I feel it might not effectively capture the context in the videos due to the temporal information being lost.

Can anyone share advice on improving this approach? Is there a more efficient or effective way to implement semantic video search with OpenAI's CLIP? Also, I'm wondering about any preprocessing steps, possible optimization strategies, or libraries that could be beneficial for this task.

Any help or guidance would be greatly appreciated. Thanks!

Upvotes: 0

Views: 880

Answers (1)

danywigglebutt
danywigglebutt

Reputation: 269

Here's a simplified step-by-step:

  1. Chunk the Video into 1-second Intervals

    To divide the video into 1-second chunks, you would typically use a library like moviepy or opencv.

    import cv2
    
    video = cv2.VideoCapture('your_video.mp4')
    
    fps = video.get(cv2.CAP_PROP_FPS)
    frames = []
    
    while(video.isOpened()):
        ret, frame = video.read()
        if ret:
            frames.append(frame)
        else:
            break
    
    video.release()
    cv2.destroyAllWindows()
    
    # Now chunk into 1-second intervals
    chunks = [frames[i:i+int(fps)] for i in range(0, len(frames), int(fps))]
    
  2. Generating the Embeddings

    For each 1-second chunk, a series of images are generated, and the embeddings are calculated using the OpenAI CLIP model.

    import torch
    import clip
    
    model, preprocess = clip.load('ViT-B/32')
    
    for chunk in chunks:
        # For each frame in the chunk, preprocess and convert to tensor
        images = [torch.unsqueeze(preprocess(frame), 0) for frame in chunk]
    
        # Stack all tensors together
        images_input = torch.cat(images, 0)
    
        # Generate the embedding
        with torch.no_grad():
            image_features = model.encode_image(images_input)
    
  3. Performing the Search

    You can use cosine similarity:

    # Calculate cosine similarity between the corpus of vectors and the query vector
    scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()
    
    # Combine docs & scores
    doc_score_pairs = list(zip(docs, scores))
    
    # Sort by decreasing score
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    
    # Output passages & scores
    for doc, score in doc_score_pairs:
        print(score, doc)

The challenge with this approach however is treating 1 second intervals as a series of frames does not capture the context of the video. They should be treated as moving images.

Mixpeek offers a managed search API that does this:

GET: https://api.mixpeek.com/v1/search?q=people+experiencing+joy

Response:

[
  {
    "content_id": "6452f04d4c0c0888bdc6b97c",
    "metadata": {
      "file_ext": "mp4",
      "file_id": "ebc289d7-44e1-4672-bf3c-ccfa490b7k2d",
      "file_url": "https://mixpeek.s3.amazonaws.com/<user>/<file>.mp4",
      "filename": "CR-9146f0.mp4",
    },
    "score": 0.636489987373352,
    "timestamps": [
      2.5035398230088495,
      1.2517699115044247,
      3.755309734513274
    ]
  }
]

Further reading and demo: https://learn.mixpeek.com/what-is-semantic-video-search/

Upvotes: 1

Related Questions