In Jina 101, you will learn about the basic components of Jina. Before you start, read [the basics of Neural Search](https://github.com/jina-ai/jina/blob/master/.github/pages/hello-world.md) to get some context and run Jina’s Hello World demos to see Jina in action.
Table of Contents
Documents are pieces of data in any dataset you want to search, and the input queries you use to find what you want. Basically, they are the input and output data for your search workflows.
A Document is a primitive data type in Jina. You can compare the document of Jina to ndarray of Numpy or tensor of Tensorflow. It is also a powerful container for multimedia data. A good example is the text we type into a search engine like Google. Documents can also be video clips, genetic sequences, MP3 files, scientific papers, funny GIFs, or many other things. You can hence store text, image, audio and the array inside the document.
A Document can be recursive in two directions - a document has its nearest neighbors and neighbors can have their own nearest neighbors. This is considered as high order matches in Jina. In the other direction a document can be divided into sub-documents and the sub-document can be further segmented into finer granularities. This rich structure allows Jina to represent complex documents with hierarchies and multiple modalities.
You can think of a Document as a chocolate bar: Different Documents have different formats or ingredients, and you can break them into smaller chunks.
Flows manage the two main search workflows: indexing data (indexing Flow) and then querying it (querying Flow).
A Flow is made up of several sub-tasks, and it manages the states and context of these sub-tasks. The input and output data of Flows are Documents. It is a high level concept in Jina, representing a sequence of steps for accomplishing a task.
You can add logic to the Flow by using add message and create parallelization using needs method. Once the Flow is built you can open it like opening a file in Python and then fit data into it.
Jina Flow is fully decentralized and can be fully distributed on the cloud. You can simply distribute a part of the Flow by setting the host to a remote address. You can also containerize a Flow either partially or completely. Besides building a Flow from Python, you can also build a Flow from a yaml config - this creates a separation between the code base and the configuration which could be extremely useful when conducting a test.
Executors perform each task in an indexing or querying Flow. Jina’s algorithmic logical units are put into different subclasses namely segmenter, ranker, encoder, crafter, classifier, indexer and evaluator. They all originate from the same parent - the Executor. To introduce a new algorithm to Jina, you can simply create a new Executor class by inheriting from the existing Executors, with focus on writing the algorithm itself. Given the input, define your own logic to generate the expected output just like when you write standard load Numpy or Tensorflow code.
Jina offers many Executors, which can be divided as follows:
Crafters pre-process input Documents, for example, resizing images or converting text to lower case. A Crafter often comes before the Encoder but it’s not always required.
Like a Crafter, a Segmenter also pre-processes Documents. A Segmenter breaks Documents into multiple chunks. For example, breaking a paragraph into sentences.
After Documents are encoded, an Indexer:
Saves Documents’ vector embeddings and metadata key-values pairs to storage (during indexing).
Retrieves the vector embeddings and key-value pairs from storage (during querying).
Classifiers classify input Documents into categories and output the predicted hard/soft labels. Classifiers are optional, but may be useful depending on the use case.
A Pea wraps an Executor and lets it exchange data with other Peas. Peas can run locally, remotely, or inside a Docker container, containing all dependencies and context in one place.
Every Pea runs inside a Pod. Sometimes multiple copies of a Pea run in a single Pod to improve efficiency and scaling.
A Pod is a cloud native container for the algorithm and interface for one or multiple Peas that have the same properties. It coordinates Peas to improve efficiency and scaling. Beyond that, a Pod adds further control, scheduling, and context management to its Peas.
It is also the basic unit in the Flow so when you code Flow add message you are basically adding Pods to the Flow. Pod allows you to customize its behavior on the cloud - it offers features such as scaling, smart routing, decentralizing, parallelization and containerization. When you visualize a Flow in jupyter notebook or Google Colab, each block in the graph is basically a part of the Executor (algorithm unit in Jina). This is the main interface that machine learning engineers and researchers would work with in Jina. Jina has hundreds of classic and state-of-the-art Executors covering pre and post-processing indexing, ranking and encoding.
A Driver “translates” input and output messages for an Executor. Each Executor requires a different data format to perform its task. Therefore, a Driver interprets incoming messages into Documents and extracts required fields for an Executor. It is thus a translation layer between Jina data type and Python or Numpy data type. It makes the Executor agnostic to data type and network. Without the driver, you will need to convert, pass, extract back and forth between the general data type and the Numpy data type.
You can define the Executor’s behavior under the network request using drivers. It translates to - while receiving request x, call the Executor and do y. Your algorithm can hence directly handle the request - that’s exactly the purpose of the driver. As an algorithm developer, driver is invisible to you and you don’t have to worry about it. In general, the Executor and driver work together and serve as a logic layer in Jina.
Jina as a framework supports abstractions at different layers and exposes them as APIs to the users or developers. High-level APIs like Flow are public while other intermediate or low-level APIs for instance at the Driver level are hidden intentionally so developers can stay focused on developing their algortihm.
- The app is by far the highest level concept in Jina. It represents a new research project that delivers end-to-end user experience. For instance, the three jina hello world demos can be called as three Jina apps as they deliver the full user journey from indexing to searching.
A typical Jina app project contains two types of files - Python
code and YAML config. The Python file defines the entrance point as a customized logic and YAML config defines a Flow composition as a configuration of each Executor.
As a framework Jina is designed to be universal - it can solve all kinds of new research problems whether it is image to image search, semantic text search or question answering - Jina can handle them all regardless of the media type. One of the major benefits of using Jina is the time it saves. Jina provides a natural and straightforward design pattern for building your search solutions on the cloud which otherwise could take months. With Jina you keep an end to end stack ownership of your search solution and avoid the hassles with fragmented, multi-vendor, generic legacy tools. Unlike other deep learning frameworks which are designed to be local, Jina is designed to be distributed on the cloud so features like containerising, distributing, sharding, asynchronous architecture, REST, GRPC, websocket work out of the box. Finally, Jina builds many state-of-the-art AI models that are easily usable and extendable with a Pythonic interface.
Jina is a happy family. You can feel the harmony when you use Jina.
You can design at the micro-level and scale up to the macro-level. YAML becomes algorithms, Pods become Flows. The patterns and logic always remain the same. This is the beauty of Jina.
Now, continue to Jina 102 to learn how these components work together!