Multi-modal and Cross-modal Search in Jina


This guide assumes you have a basic understanding of Jina, if you haven’t, please check out Jina 101 first.

Jina is a data type-agnostic framework, letting you work with any type of data and run cross- and multi-modal search Flows. To better understand what this implies we first need to understand the concept of modality.

Feature description

You may think that different modalities correspond to different kinds of data (images and text in this case). However, this is not accurate. For example, you can do cross-modal search by searching images from different perspectives, or searching for matching titles for given paragraph text. Therefore, we can consider that a modality is related to a given data distribution from which input may come.

For this reason, and to have first-class support for cross and multi-modal search, Jina offers modality as an attribute in its Document primitive type. Now that we agreed on the concept of modality, we can describe cross-modal and multi-modal search.

  • Cross-modal search can be defined as a set of retrieval applications that try to effectively find relevant documents of modality A by querying with documents from modality B.

  • Multi-modal search can be defined as a set of retrieval applications that can leverage multiple modalities at query time.

The main difference between these two search modes is that for cross-modal, there is a direct mapping between a single document and a vector in embedding space, while for multi-modal this does not hold true, since 2 or more documents might be combined into a single vector.

This unlocks a lot of powerful patterns and makes Jina fully flexible and agnostic to what can be searched.

What’s Next

Thanks for your time & effort while reading this documentation. Please go to the example projects and start to get your hands dirty!

If you still have questions, feel free to submit an issue or post a message in our community slack channel .