Flow Optimization¶
The FlowOptimizer runs Flows with different parameter sets. It allows out of the box hyperparameter tuning inside Jina.¶
Table of Contents
Flow Optimization is hyperparameter tuning¶
A common pattern when building a Flow in Jina is:
Design the high level
flow.yml
Define several
pods.yml
files for all needed ExecutorsRepeat until the evaluation metric suits your use case:
Change variable values in different Executors (e.g. used model, used model layer, details of the segmenter)
Index some data
Query some data and look at the results
The FlowOptimizer
automates step 3.
It can be done via python code or JAML definitions as commonly used around Jina.
Before you start¶
Make sure you install Jina via Installation with jina[optimizer]
.
Read the Evaluator
entry in the glossary.
Using the FlowOptimizer¶
In this toy example, we try to find the optimal layer of an encoder for the final embedding. This is a common practice in machine learning. The best semantic representation for a given problem might not be the last layer of a given model.
A Flow Optimization requires the following components:
At least one
Flow
and the corresponding Pod definitions via JAMLAn Evaluator Executor in at least one of the Flows
A datasource containing Documents, which are sent to each Flow
A
parameter.yml
file describing the optimization scenarioA
FlowRunner
object, which allows repeatedly running the same Flow with different configurations.
Let’s define a flow.yml
:
!Flow
version: '1'
env:
JINA_ENCODER_LAYER_VAR: ${{JINA_ENCODER_LAYER}}
pods:
- uses: encoder.yml
- uses: EuclideanEvaluator
The FlowOptimizer
will change the value of JINA_ENCODER_LAYER
later on.
The Flow passes it on to the encoder.yml
via the JINA_ENCODER_LAYER_VAR
.
The EuclideanEvaluator
is used for calculating the distance between the calculated encoding and the expected one.
Furthermore, we need the corresponding encoder.yml
:
!SimpleEncoder
with:
layer: ${{JINA_ENCODER_LAYER_VAR}}
import numpy as np
from jina.executors.encoders import BaseEncoder
class SimpleEncoder(BaseEncoder):
ENCODE_LOOKUP = {
'🐲': [1, 3, 5],
'🐦': [2, 4, 7],
'🐢': [0, 2, 5],
}
def __init__(self, layer=0, *args, **kwargs):
super().__init__(*args, **kwargs)
self._column = layer
def encode(self, data: Sequence[str], *args, **kwargs) -> 'np.ndarray':
return np.array([[self.ENCODE_LOOKUP[data[0]][self._column]]])
The SimpleEncoder
is not doing any computation.
For illustration purposes, it just chooses precomputed values for the different queries.
Thus, the semantic switch from layer
to _column
So choosing one column
here is comparable with choosing a layer in a real world encoder (the second layer for 🐦
would result in the encoding [4]
).
As the next step we need some ground truth data.
from jina import Document
documents = [
(Document(content='🐲'), Document(embedding=np.array([2]))),
(Document(content='🐦'), Document(embedding=np.array([3]))),
(Document(content='🐢'), Document(embedding=np.array([3])))
]
Documents will be sent in pairs (doc, groundtruth)
to the Flow.
The doc
represents a Document that should be encoded.
The groundtruth
contains the ideal encoding.
The perfect semantic encoding for 🐲
would be 2
.
Note: In a real world example the groundtruth would rather be documents, that should be retrieved after querying. For the sake of simplicity we omitted the indexing step in this example.
The :class:FlowRunner
wraps the Flow and the Documents for rerunnability.
This ensures no side effects between different Flow runs during optimization.
from jina.optimizers.flow_runner import SingleFlowRunner
runner = SingleFlowRunner('flow.yml', documents, 1, 'search', overwrite_workspace=True)
Now we need to tell the FlowOptimizer
, what it can optimize:
The JINA_ENCODER_LAYER
variable.
This is done via a parameter.yml
file:
- !IntegerParameter
jaml_variable: JINA_ENCODER_LAYER
high: 2
low: 0
step_size: 1
The variable JINA_ENCODER_LAYER
can take int
values in the range [0, 2]
.
Possible choices for variables are:
IntegerParameter and DiscreteUniformParameter for
int
based python variables (e.g. layer of a model)UniformParameter and LogUniformParameter for
float
based python variables (e.g. confidence threshold in object detection)CategoricalParameter for python variables which can be categorized (e.g. model names)
Under the hood, Jina leverages the optuna optimizer.
Finally, we can define the :class:FlowOptimizer
and run it:
from jina.optimizers import FlowOptimizer, MeanEvaluationCallback
optimizer = FlowOptimizer(
flow_runner=runner,
parameter_yaml='parameter.yml',
evaluation_callback=MeanEvaluationCallback(),
n_trials=3,
direction='minimize',
seed=1
)
result = optimizer.optimize_flow()
The MeanEvaluationCallback
takes the results of the last Evaluator inside a Flow and averages the results.
In the above defined Flow it is the single EuclideanEvaluator
.
Finally, we can write the optimal parameters into a file:
result.save_parameters('result_file.yml')
If you are familiar with optuna
, you can access more information directly from the optuna study object via result.study
.
For example result.study.trials
contains detailed information about all trials.
Limitations¶
Currently it is not possible to optimize a Flow that is defined via the python interface.