Multi & Cross Modality

Jina is a data type-agnostic framework, letting you work with any type of data and run cross- and multi-modal search Flows.

To better understand what this implies we first need to understand the concept of modality.

One may think that different modalities correspond to different kinds of data (images and text in this case). However, this is not accurate. For example, you can do cross-modal search by searching images from different perspectives, or searching for matching titles for given paragraph text. Therefore, we can consider that a modality is related to a given data distribution from which input may come.

For this reason, and to have first-class support for cross and multi-modal search, Jina offers modality as an attribute in its Document primitive type. Now that we agreed on the concept of modality, we can describe cross-modal and multi-modal search.

  • Cross-modal search can be defined as a set of retrieval applications that try to effectively find relevant documents of modality A by querying with documents from modality B.

  • Multi-modal search can be defined as a set of retrieval applications that try to effectively project documents of different modalities into a common embedding space, and find relevant documents with respect to the fusion of multiple modalities

The main difference between these two search modes is that for cross-modal, there is a direct mapping between a single document or chunk and a vector in embedding space, while for MultiModal this does not hold true, since 2 or more documents might be combined into a single vector.

This unlocks a lot of powerful patterns and makes Jina fully flexible and agnostic to what can be searched.