Reputation: 269
I am attempting to extend OpenAI's CLIP functionality to semantic video search. Essentially, my objective is to input a text query and get relevant video segments/clips that match the semantic content of the text query. Here's what I've thought so far:
However, this approach seems quite naive, and I feel it might not effectively capture the context in the videos due to the temporal information being lost.
Can anyone share advice on improving this approach? Is there a more efficient or effective way to implement semantic video search with OpenAI's CLIP? Also, I'm wondering about any preprocessing steps, possible optimization strategies, or libraries that could be beneficial for this task.
Any help or guidance would be greatly appreciated. Thanks!
Upvotes: 0
Views: 880
Reputation: 269
Here's a simplified step-by-step:
Chunk the Video into 1-second Intervals
To divide the video into 1-second chunks, you would typically use a library like moviepy
or opencv
.
import cv2
video = cv2.VideoCapture('your_video.mp4')
fps = video.get(cv2.CAP_PROP_FPS)
frames = []
while(video.isOpened()):
ret, frame = video.read()
if ret:
frames.append(frame)
else:
break
video.release()
cv2.destroyAllWindows()
# Now chunk into 1-second intervals
chunks = [frames[i:i+int(fps)] for i in range(0, len(frames), int(fps))]
Generating the Embeddings
For each 1-second chunk, a series of images are generated, and the embeddings are calculated using the OpenAI CLIP model.
import torch
import clip
model, preprocess = clip.load('ViT-B/32')
for chunk in chunks:
# For each frame in the chunk, preprocess and convert to tensor
images = [torch.unsqueeze(preprocess(frame), 0) for frame in chunk]
# Stack all tensors together
images_input = torch.cat(images, 0)
# Generate the embedding
with torch.no_grad():
image_features = model.encode_image(images_input)
Performing the Search
You can use cosine similarity:
# Calculate cosine similarity between the corpus of vectors and the query vector
scores = util.cos_sim(query_vector, corpus_vector)[0].cpu().tolist()
# Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
# Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
# Output passages & scores
for doc, score in doc_score_pairs:
print(score, doc)
The challenge with this approach however is treating 1 second intervals as a series of frames does not capture the context of the video. They should be treated as moving images.
Mixpeek offers a managed search API that does this:
GET: https://api.mixpeek.com/v1/search?q=people+experiencing+joy
Response:
[
{
"content_id": "6452f04d4c0c0888bdc6b97c",
"metadata": {
"file_ext": "mp4",
"file_id": "ebc289d7-44e1-4672-bf3c-ccfa490b7k2d",
"file_url": "https://mixpeek.s3.amazonaws.com/<user>/<file>.mp4",
"filename": "CR-9146f0.mp4",
},
"score": 0.636489987373352,
"timestamps": [
2.5035398230088495,
1.2517699115044247,
3.755309734513274
]
}
]
Further reading and demo: https://learn.mixpeek.com/what-is-semantic-video-search/
Upvotes: 1