What is Cross-Modal & Multi-Modal?#

Jina is the framework for helping you build cross-modal and multi-modal applications on the cloud. But first, what is cross-modal and multi-modal? And what are the applications? This chapter will answer these preliminary questions.

A video version of this chapter is available below.

Beyond single modality#

The term “Modal” is shorthand for “Data Modality”. Data modality can be thought of as the “type” of data. For example, a tweet is a modal of type “text”; a photo is a modal of type “image”; a video is a modal of type “video”; etc.

In the early days of AI, research was focused on a single modality, such as vision or language. For example, a spam filter is focused on text modality. A photo classifier is focused on image modality. A music recommender is focused on audio modality. However, it soon became clear that in order to create truly intelligent systems, AI must be able to integrate multiple modalities. In the real world, data is often multimodal, meaning that it consists of multiple modalities. For example, a tweet often contains not only text, but also images, videos, and links. A video often contains not only video frames, but also audio and text (e.g. subtitles). This has led to the development of cross-modality and multi-modality in AI.

Multi-modal machine learning is a relatively new field that is concerned with the development of algorithms that can learn from multiple modalities of data.

Cross-modal machine learning is a subfield of multi-modal machine learning that is concerned with the development of algorithms that can learn from multiple modalities of data that are not necessarily aligned. For example, learning from images and text where the images and text are not necessarily about the same thing.

Thanks to recent advances in deep neural networks, cross-modal or multi-modal technologies enable advanced intelligence on all kinds of unstructured data, such as images, audio, video, PDF, 3D meshes, and more.

Cross-modality and multi-modality are two terms that are often used interchangeably, but there is a big difference between the two. Multi-modality refers to the ability of a system to use multiple modalities, or input channels, to achieve a desired goal. For example, a human can use both sight and hearing to identify a person or object. In contrast, cross-modality refers to the ability of a system to use information from one modality to improve performance in another modality. For example, if you see a picture of a dog, you might be able to identify it by its bark when you hear it.

AI systems that are designed to work with multiple modalities are said to be “multi-modal.” However, the term “cross-modality” is more accurate when referring to AI systems that use information from one modality to improve performance in another.

In general, cross-modal and multi-modal technologies allow for a more holistic understanding of data, as well as increased accuracy and efficiency.

Applications#

There are many potential applications of cross-modal and multi-modal machine learning. For example, a cross-modal machine learning algorithm could be used to automatically generate descriptions of images (e.g. for blind people). A search system could use a cross-modal machine learning algorithm to search for images by text queries (e.g. “find me a picture of a dog”). A text-to-image generation system could use a cross-modal machine learning algorithm to generate images from text descriptions (e.g. “generate an image of a dog”).

Cross-modal AI systems have the potential to greatly improve the performance of AI systems by making them more flexible and robust. For example, a cross-modal system could be used to improve the accuracy of facial recognition algorithms by using information from other modalities such as body language or voice. Another potential application is using information from one modality to compensate for the limitations of another. For example, if an image recognition algorithm is having difficulty identifying an object due to poor lighting conditions, information from another modality such as sound could be used to help identify the object.

Under this big umbrella sits two families of applications: neural search and creative AI.

Creative AI#

Another potential application of cross-modal machine learning is creative AI. Creative AI systems use artificial intelligence to generate new content, such as images, videos, or text. For example, Open AI GPT-3 is a machine learning platform that can generate text. The system is trained on a large corpus of text, such as books, articles, and websites. Once trained, the system can generate new text that is similar to the training data. This can be used to generate new articles, stories, or even poems.

OpenAI’s DALL·E is another example of a creative AI system. This system generates images from textual descriptions. For example, given the text “a black cat with green eyes”, the system will generate an image of a black cat with green eyes. Below is an example of generating images from a text prompt using DALL·E Flow (a text-to-image system built on top of Jina).

server_url = 'grpc://dalle-flow.jina.ai:51005'
prompt = 'an oil painting of a humanoid robot playing chess in the style of Matisse'

from docarray import Document

doc = Document(text=prompt).post(server_url, parameters={'num_images': 8})
da = doc.matches

da.plot_image_sprites(fig_size=(10, 10), show_index=True)
https://github.com/jina-ai/dalle-flow/raw/main/.github/client-dalle.png?raw=true

Creative AI holds great potential for the future. It has the potential to revolutionize how we interact with machines, helping us create more personalized experiences, e.g.:

  • Create realistic 3D images and videos of people and objects, which can be used in movies, video games, and other visual media.

  • Generate realistic and natural-sounding dialogue, which can be used in movies, video games, and other forms of entertainment.

  • Create new and innovative designs for products, which can be used in manufacturing and other industries.

  • Create new and innovative marketing campaigns, which can be used in advertising and other industries.

Relationship is the key#

So what ties neural search and creative AI together?

The “relationship” between or within modalities.

What is this “relationship” are we talking about now? Let’s see the following illustration, where we managed to represent text “cat”, “dog”, “human”, “ape” and their images into one embedding space:

../../_images/relationship.svg

The “relationship” encodes the following information:

  • The text embedding of “cat” is closer to “dog” (same modality);

  • The text embedding of “human” is closer to “ape” (same modality);

  • The text embedding of “cat” is farther from “human” (same modality);

  • The text embedding of “cat” is closer to the image embedding of “cat” (different modality);

  • The image embedding of “cat” is closer to the image embedding of “dog” (same modality);

  • etc.

Don’t underestimate the power of this relationship. It is the foundation of neural search and creative AI. It is like the DNA of a species. Once mastered, it can be used to find the closest match to any other species, and create new species!

../../_images/dna.png

In summary, the key of cross-modal and multi-modal applications is to understand the relationship between modalities. With this relationship, one can use it to find existing data, which is neural search; or use it to make new data, which is creative AI.

In the next chapter, we will see how Jina is the ideal tool for building cross-modal and multi-modal applications on the cloud.