Implementating Named Entity Recognition using embeddings

I want to use OpenAI's CLIP model to perform Multimodal Named Entity Recognition on an image-text dataset.

I have converted these image-text into embeddings, but how do I perform NER on them now? Or is there a better approach using the CLIP model?

Upvotes: 0