Parallelization in Jina¶
Learn how Jina handles parallelization of data processing¶
Feature description and expected outcome¶
Jina allows for parallelization of Peas and Pods. This generally leads to an increase in throughput, as the data is split across processes.
Before you start¶
You are expected to be familiar with the general concepts of Jina, as presented in here.
You should also have experience with parallelism in other systems. Jina does follow the best practices and designs from other databases or search engines. However, Jina is primarily a neural search engine. This means some changes and adaptations were necessary. This chapter will explain these.
We will approach this from a top-down perspective.
Pods are the highest level of abstraction where we can have parallelism. This is achieved by using the
f = (Flow().add(name='p1', needs='gateway') .add(name='p2', needs='gateway') .add(name='p3', needs='gateway') .needs(['p1','p2', 'p3'], name='r1').plot())
The plot of the above topology looks like this:
In the above,
p3 now subscribe to
gateway and conduct their work in parallel. The last
.needs() blocks all Pods until they finish their work.
Peas are the next level at which we can configure parallelism. In this case,
Pod receives the data (in its
HeadPea) and then distributes it to its
Peas. You can achieve this by using the
f = (Flow().add(name='p1', needs='gateway') .add(name='p2', needs='gateway') .add(name='p3', parallel=3) .needs(['p1','p3'], name='r1').plot())
State-less vs state-ful¶
Depending on the type of
parallel means different things. In state-less
Encoders etc.), you parallelize, not shard. This means the data is not persisted, but it’s only processed in a parallel manner.
On the other hand, state-ful
Peas (Indexers) will store the data across shards. This terminology is borrowed from other database systems.
For more on sharding in other DB systems (with a focus on traditional SQL), see [here].(https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781849516723/1/ch01lvl1sec10/using-sharding-and-data-distribution)
For both of these, you need to set
parallel to some number higher than
Index vs Query time Polling¶
any in order to tell the
HeadPea to spread the data across its inner
Peas. This allows us to parallelize. You do this for the Flow you will use to index data.
However, when querying, you need to make sure to use a Flow where the
Pods are using
polling: all. This ensures that the requests are seen by all the
Peas, and that the respective
id is found. If we use
polling: any we risk that the key is not seen by the shard that has the data.
Jina offers two types of scheduling:
round_robinwill assign tasks in circular order, without any prioritization. We assume all
Peashave uniform processing power.
load_balancewill assign tasks to the idle
These are configured via
scheduling, on the
An example of configuring the above parameters can be seen here:
from jina.enums import SchedulerType f = Flow().add(name='p1', parallel=3, scheduling=SchedulerType.ROUND_ROBIN)
At the moment Jina does not support sharding and replication of shards at the same time. We only support sharding (
Peas of the
Pod with their own parts of the data).