Sentence Transformers now does cross-modal search out of the box

Sentence Transformers v5.4, released this week, adds multimodal embedding and reranking support. You can now encode text, images, audio, and video into a shared embedding space and run similarity search across modalities using the same API that has been the default for text-only retrieval for years. The supported models include Qwen3-VL (2B and 8B variants), NVIDIA Nemotron, and BAAI BGE-VL for embeddings, with matching rerankers available for the same families.

What this changes in practice is the retrieval layer of multimodal RAG pipelines. Previously, building text-to-image or image-to-text search required separate tooling: a vision-language model for encoding, custom similarity infrastructure, and coordination logic for mixed-modality result sets. Sentence Transformers wraps all of that into a single SentenceTransformer model object with the same .encode() and .similarity() interface. Cross-modal search becomes a configuration choice rather than an architecture decision.

The retrieve-and-rerank pattern, which is standard for text retrieval — fast embedding-based first-pass, accurate cross-encoder reranking on the top-k — now extends to mixed modalities. A query can be text, the corpus can be a mix of images, documents, and video clips, and the reranker can score relevance across all of them. The 2B Qwen3-VL model requires around 8 GB VRAM, which puts it within reach of most developer workstations. The 8B model needs 20 GB but gives higher accuracy.

One thing practitioners should know before diving in: the modality gap is real. Cross-modal similarity scores (text query, image corpus) are systematically lower than within-modal scores (text query, text corpus), because embeddings from different modalities cluster separately in the shared space. Absolute score thresholds that work for text-only retrieval will not transfer directly. The relative ordering holds, so retrieval works, but any routing logic that gates on a minimum similarity score needs recalibration.

Install with pip install -U "sentence-transformers[image,video]". The Qwen3-VL models are currently in pull-request status on Hugging Face (referenced as revision="refs/pr/23" in the examples), which means the API is functional but the models have not been formally merged. Worth testing now if cross-modal retrieval is on your roadmap — but pin the revision rather than tracking main if you are putting this in production.