Guideline When Adding New Executor

New deep learning model? New indexing algorithm? When the existing executors/drivers do not fit your requirement, and you can not find a useful one from Jina Hub, you can simply extend Jina to what you need without even touching the Jina codebase.

In this chapter, we will show you the guideline of making an extension for a jina.executors.BaseExecutor. Generally speaking, the steps are the following:

  1. Decide which Executor class to inherit from;

  2. Override __init__() and post_init();

  3. Override the core method of the base class;

  4. (Optional) implement the save logic.

Decide which Executor class to inherit from

The list of executors supported by the current Jina can be found here. As one can see, all executors are inherited from jina.executors.BaseExecutor. So do you want to inherit directly from BaseExecutor for your extension as well? In general you don’t. Rule of thumb, you always pick the executor that shares the similar logic to inherit.

If your algorithm is so unique and does not fit any any of the category below, you may want to submit an issue for discussion before you start.

Note

Inherit from class X when …

  • jina.executors.encoders.BaseEncoder

    You want to represent the chunks as vector embeddings.

  • jina.executors.indexers.BaseIndexer

    You want to save and retrieve vectors and key-value information from storage.

  • jina.executors.craters.BaseCrafter

    You want to segment/transform the documents and chunks.

    • jina.executors.craters.BaseDocCrafter

      You want to transform the documents by modifying some fields.

      • jina.executors.craters.BaseChunkCrafter

        You want to transform the chunks by modifying some fields.

      • jina.executors.craters.BaseSegmenter

        You want to segment the documents into chunks.

  • jina.executors.Chunk2DocRanker

    You want to segment/transform the documents and chunks.

  • jina.executors.CompoundExecutor

    You want to combine multiple executors in one.

Override __init__() and post_init()

Override __init__()

You can put simple type attributes that define the behavior of your Executor into __init__(). Simple types represent all pickle-able types, including: integer, bool, string, tuple of simple types, list of simple types, map of simple type. For example,

from jina.executors.crafters import BaseSegmenter

class GifPreprocessor(BaseSegmenter):
  def __init__(self, img_shape: int = 96, every_k_frame: int = 1, max_frame: int = None, from_bytes: bool = False, *args, **kwargs):
      super().__init__(*args, **kwargs)
      self.img_shape = img_shape
      self.every_k_frame = every_k_frame
      self.max_frame = max_frame
      self.from_bytes = from_bytes

Remember to add super().__init__(*args, **kwargs) to your __init__(). Only in this way you can enjoy many magic features, e.g. YAML support, persistence from the base class (and BaseExecutor).

Note

All attributes declared in __init__() will be persisted during save() and load().

Override post_init()

So what if the data you need to load is not in simple type. For example, a deep learning graph, a big pretrained model, a gRPC stub, a tensorflow session, a thread? The you can put them into post_init().

Another scenario is when you know there is a better persistence method other than pickle. For example, your hyperparameters matrix in numpy ndarray is certainly pickable. However, one can simply read and write it via standard file IO, and it is likely more efficient than pickle. In this case, you do the data loading in post_init().

Here is a good example.

from jina.executors.encoders import BaseTextEncoder

class TextPaddlehubEncoder(BaseTextEncoder):

    def __init__(self,
                 model_name: str = 'ernie_tiny',
                 max_length: int = 128,
                 *args,
                 **kwargs):
        super().__init__(*args, **kwargs)
        self.model_name = model_name
        self.max_length = max_length

    def post_init(self):
        import paddlehub as hub
        self.model = hub.Module(name=self.model_name)
        self.model.MAX_SEQ_LEN = self.max_length

Note

post_init() is also a good place to introduce package dependency, e.g. import x or from x import y. Naively, one can always put all imports upfront at the top of the file. However, this will throw an ModuleNotFound exception when this package is not installed locally. Sometimes it may break the whole system because of this one missing dependency.

Rule of thumb, only import packages where you really need them. Often these dependencies are only required in post_init() and the core method, which we shall see later.

Override the core method of the base class

Each Executor has a core method, which defines the algorithmic behavior of the Executor. For making your own extension, you have to override the core method. The following table lists the core method you may want to override. Note some executors may have multiple core methods.

Base class

Core method(s)

BaseEncoder

encode()

BaseCrafter

craft()

BaseIndexer

add(), query()

BaseRanker

score()

Feel free to override other methods/properties as you need. But frankly, most of the extension can be done by simply overriding the core methods listed above. Nothing more. You can read the source code of our executors for details.

Implement the persistence logic

If you don’t override post_init(), then you don’t need to implement persistence logic. You get YAML and persistency support off-the-shelf because of BaseExecutor. Simple crafters and rankers fall into this category.

If you override post_init() but you don’t care about persisting its state in the next run (when the executor process is restarted); or the state is simply unchanged during the run, then you don’t need to implement persistence logic. Loading from a fixed pretrained deep learning model falls into this category.

Persistence logic is only required when you implement customized loading logic in :meth:`post_init` and the state is changed during the run. Then you need to override __getstate__(). Many of the indexers fall into this category.

In the example below, the tokenizer is loaded in post_init() and saved in __getstate__(), whcih completes the persistency cycle.

class CustomizedEncoder(BaseEncoder):

    def post_init(self):
        self.tokenizer = tokenizer_dict[self.model_name].from_pretrained(self._tmp_model_path)
        self.tokenizer.padding_side = 'right'

    def __getstate__(self):
        self.tokenizer.save_pretrained(self.model_abspath)
        return super().__getstate__()

How Can I Use My Extension

You can use the extension by specifying py_modules in the YAML file. For example, your extension Python file is called my_encoder.py, which describes MyEncoder. Then you can define a YAML file (say my.yml) as follows:

!MyEncoder
with:
  greetings: hello im external encoder
metas:
  py_modules: my_encoder.py

Note

You can also assign a list of files to metas.py_modules if your Python logic is splitted over multiple files. This YAML file and all Python extension files should be put under the same directory.

Then simply use it in Jina CLI by specifying jina pod --yaml-path=my.yml, or Flow().add(yaml_path='my.yml') in Flow API.

Warning

If you use customized executor inside a jina.executors.CompoundExecutor, then you only need to set metas.py_modules at the root level, not at the sub-component level.

I Want to Contribute it to Jina

We are really glad to hear that! We have done quite some effort to help you contribute and share your extensions with others.

You can easily pack your extension and share it with others via Docker image. For more information, please check out Jina Hub. Just make a pull request there and our CICD system will take care of building, testing and uploading.