Guideline When Adding New Executor¶
New deep learning model? New indexing algorithm? When the existing executors/drivers do not fit your requirement, and you can not find a useful one from Jina Hub, you can simply extend Jina to what you need without even touching the Jina codebase.
In this chapter, we will show you the guideline of making an extension for a jina.executors.BaseExecutor
. Generally speaking, the steps are the following:
Decide which
Executor
class to inherit from;Override
__init__()
andpost_init()
;Override the core method of the base class;
(Optional) implement the save logic.
Decide which Executor
class to inherit from¶
The list of executors supported by the current Jina can be found here. As one can see, all executors are inherited from jina.executors.BaseExecutor
. So do you want to inherit directly from BaseExecutor
for your extension as well? In general you don’t. Rule of thumb, you always pick the executor that shares the similar logic to inherit.
If your algorithm is so unique and does not fit any any of the category below, you may want to submit an issue for discussion before you start.
Note
Inherit from class X
when …
jina.executors.encoders.BaseEncoder
You want to represent the chunks as vector embeddings.
jina.executors.encoders.BaseNumericEncoder
You want to represent numpy array object (e.g. image, video, audio) as vector embeddings.
jina.executors.encoders.BaseTextEncoder
You want to represent string object as vector embeddings.
jina.executors.indexers.BaseIndexer
You want to save and retrieve vectors and key-value information from storage.
jina.executors.indexers.BaseVectorIndexer
You want to save and retrieve vectors from storage.
jina.executors.indexers.NumpyIndexer
You vector-indexer uses a simple numpy array for storage, you only want to specify the query logic.
jina.executors.indexers.BaseKVIndexer
You want to save and retrieve key-value pair from storage.
jina.executors.craters.BaseCrafter
You want to segment/transform the documents and chunks.
jina.executors.craters.BaseDocCrafter
You want to transform the documents by modifying some fields.
jina.executors.craters.BaseChunkCrafter
You want to transform the chunks by modifying some fields.
jina.executors.craters.BaseSegmenter
You want to segment the documents into chunks.
jina.executors.Chunk2DocRanker
You want to segment/transform the documents and chunks.
jina.executors.CompoundExecutor
You want to combine multiple executors in one.
jina.executors.BaseClassifier
You want to enrich the documents and chunks with a classifer.
Override __init__()
and post_init()
¶
Override __init__()
¶
You can put simple type attributes that define the behavior of your Executor
into __init__()
. Simple types represent all pickle-able types, including: integer, bool, string, tuple of simple types, list of simple types, map of simple type. For example,
from jina.executors.crafters import BaseSegmenter
class GifPreprocessor(BaseSegmenter):
def __init__(self, img_shape: int = 96, every_k_frame: int = 1, max_frame: int = None, from_bytes: bool = False, *args, **kwargs):
super().__init__(*args, **kwargs)
self.img_shape = img_shape
self.every_k_frame = every_k_frame
self.max_frame = max_frame
self.from_bytes = from_bytes
Remember to add super().__init__(*args, **kwargs)
to your __init__()
. Only in this way you can enjoy many magic features, e.g. YAML support, persistence from the base class (and BaseExecutor
).
Note
All attributes declared in __init__()
will be persisted during save()
and load()
.
Override post_init()
¶
So what if the data you need to load is not in simple type. For example, a deep learning graph, a big pretrained model, a gRPC stub, a tensorflow session, a thread? The you can put them into post_init()
.
Another scenario is when you know there is a better persistence method other than pickle
. For example, your hyperparameters matrix in numpy ndarray
is certainly pickable. However, one can simply read and write it via standard file IO, and it is likely more efficient than pickle
. In this case, you do the data loading in post_init()
.
Here is a good example.
from jina.executors.encoders import BaseTextEncoder
class TextPaddlehubEncoder(BaseTextEncoder):
def __init__(self,
model_name: str = 'ernie_tiny',
max_length: int = 128,
*args,
**kwargs):
super().__init__(*args, **kwargs)
self.model_name = model_name
self.max_length = max_length
def post_init(self):
import paddlehub as hub
self.model = hub.Module(name=self.model_name)
self.model.MAX_SEQ_LEN = self.max_length
Note
post_init()
is also a good place to introduce package dependency, e.g. import x
or from x import y
. Naively, one can always put all imports upfront at the top of the file. However, this will throw an ModuleNotFound
exception when this package is not installed locally. Sometimes it may break the whole system because of this one missing dependency.
Rule of thumb, only import packages where you really need them. Often these dependencies are only required in post_init()
and the core method, which we shall see later.
Override the core method of the base class¶
Each Executor
has a core method, which defines the algorithmic behavior of the Executor
. For making your own extension, you have to override the core method. The following table lists the core method you may want to override. Note some executors may have multiple core methods.
Base class |
Core method(s) |
|
|
|
|
|
|
|
|
|
|
|
|
Feel free to override other methods/properties as you need. But frankly, most of the extension can be done by simply overriding the core methods listed above. Nothing more. You can read the source code of our executors for details.
Implement the persistence logic¶
If you don’t override post_init()
, then you don’t need to implement persistence logic. You get YAML and persistency support off-the-shelf because of BaseExecutor
. Simple crafters and rankers fall into this category.
If you override post_init()
but you don’t care about persisting its state in the next run (when the executor process is restarted); or the state is simply unchanged during the run, then you don’t need to implement persistence logic. Loading from a fixed pretrained deep learning model falls into this category.
Persistence logic is only required when you implement customized loading logic in :meth:`post_init` and the state is changed during the run. Then you need to override __getstate__()
. Many of the indexers fall into this category.
In the example below, the tokenizer
is loaded in post_init()
and saved in __getstate__()
, whcih completes the persistency cycle.
class CustomizedEncoder(BaseEncoder):
def post_init(self):
self.tokenizer = tokenizer_dict[self.model_name].from_pretrained(self._tmp_model_path)
self.tokenizer.padding_side = 'right'
def __getstate__(self):
self.tokenizer.save_pretrained(self.model_abspath)
return super().__getstate__()
How Can I Use My Extension¶
You can use the extension by specifying py_modules
in the YAML file. For example, your extension Python file is called my_encoder.py
, which describes MyEncoder
. Then you can define a YAML file (say my.yml
) as follows:
!MyEncoder
with:
greetings: hello im external encoder
metas:
py_modules: my_encoder.py
Note
You can also assign a list of files to metas.py_modules
if your Python logic is splitted over multiple files. This YAML file and all Python extension files should be put under the same directory.
Then simply use it in Jina CLI by specifying jina pod --uses=my.yml
, or Flow().add(uses='my.yml')
in Flow API.
Warning
If you use customized executor inside a jina.executors.CompoundExecutor
, then you only need to set metas.py_modules
at the root level, not at the sub-component level.
I Want to Contribute it to Jina¶
We are really glad to hear that! We have done quite some effort to help you contribute and share your extensions with others.
You can easily pack your extension and share it with others via Docker image. For more information, please check out Jina Hub. Just make a pull request there and our CICD system will take care of building, testing and uploading.