(Beta) New DocArray support#
Jina provides early support for DocArray>=0.30 which is a rewrite of DocArray. This new version makes the dataclass feature of DocArray v1 a first-class citizen and for this purpose it is built on top of Pydantic. An important shift is that the new DocArray adapts to users’ data, whereas DocArray v1 forces user to adapt to the Document schema.
Warning
Beta support New DocArray syntax is available on DocArray version beyond 0.30. Not every feature in Jina has been adapted to the new DocArray versions, but some of them are. So you can consider that the support of this new version is in Beta. The plan is to keep compatibility with 2 sets of versions after the migration is achieved.
New DocArray schema#
At the heart of DocArray>=0.30 is a new schema that is more flexible and expressive than the original DocArray schema.
You can refer to the DocArray README for more details. Please note that also the names of data structure change in the new version of DocArray.
On the Jina side, this flexibility extends to every Executor, where you can now customize input and output schemas:
With DocArray<0.30 (the version currently used by default in Jina), a Document has a fixed schema and an Executor performs in-place operations on it.
With DocArray>=0.30, an Executor defines its own input and output schemas. It also provides several predefined schemas that you can use out of the box.
New Executor API#
To reflect the change with DocArray v2, the Executor API now supports schema definition. The design is inspired by FastAPI.
from jina import Executor, requests
from docarray import DocList, BaseDoc
from docarray.documents import ImageDoc
from docarray.typing import AnyTensor
import numpy as np
class InputDoc(BaseDoc):
img: ImageDoc
class OutputDoc(BaseDoc):
embedding: AnyTensor
class MyExec(Executor):
@requests(on='/bar')
def bar(
self, docs: DocList[InputDoc], **kwargs
) -> DocumentArray[OutputDoc]:
docs_return = DocList[OutputDoc](
[OutputDoc(embedding=np.zeros((100, 1))) for _ in range(len(docs))]
)
return docs_return
For our Executor we define:
An input schema
InputDoc
and an output schemaOutputDoc
, which are Documents.The
bar
endpoint, which takes a DocumentArray ofInputDoc
as input and returns a DocumentArray ofOutputDoc
.
Note that the type hint is actually more that just a hint – the Executor uses it to infer the actual schema of the endpoint.
You can also explicitly define the schema of the endpoint by using the request_schema
and
response_schema
parameters of the requests
decorator:
class MyExec(Executor):
@requests(
on='/bar',
request_schema=DocList[InputDoc],
response_schema=DocList[OutputDoc],
)
def bar(self, docs, **kwargs):
docs_return = DocList[OutputDoc](
[OutputDoc(embedding=np.zeros((100, 1))) for _ in range(len(docs))]
)
return docs_return
If there is no request_schema
and response_schema
, the type hint is used to infer the schema. If both exist, request_schema
and response_schema
will be used.
Serve one Executor in a Deployment#
Once you have defined the Executor with the New Executor API, you can easily serve and scale it as a Deployment with gRPC
, HTTP
or any combination of these
protocols.
from jina import Deployment
with Deployment(uses=MyExec, protocol='grpc', replicas=2) as dep:
dep.block()
Chain Executors in Flow with different schemas#
With the new API, when building a Flow you should ensure that the Document types used as input of an Executor match the schema of the output of its incoming previous Flow.
For instance, this Flow will fail to start because the Document types are wrongly chained.
from jina import Executor, requests, Flow
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np
class SimpleStrDoc(BaseDoc):
text: str
class TextWithEmbedding(SimpleStrDoc):
embedding: NdArray
class TextEmbeddingExecutor(Executor):
@requests(on='/foo')
def foo(docs: DocList[SimpleStrDoc], **kwargs) -> DocList[TextWithEmbedding]
ret = DocList[TextWithEmbedding]()
for doc in docs:
ret.append(TextWithEmbedding(text=doc.text, embedding=np.ramdom.rand(10))
return ret
class ProcessEmbedding(Executor):
@requests(on='/foo')
def foo(docs: DocList[TextWithEmbedding], **kwargs) -> DocList[TextWithEmbedding]
for doc in docs:
self.logger.info(f'Getting embedding with shape {doc.embedding.shape}')
flow = Flow().add(uses=TextEmbeddingExecutor, name='embed').add(uses=ProcessEmbedding, name='process')
with flow:
flow.block()
from jina import Executor, requests, Flow
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
import numpy as np
class SimpleStrDoc(BaseDoc):
text: str
class TextWithEmbedding(SimpleStrDoc):
embedding: NdArray
class TextEmbeddingExecutor(Executor):
@requests(on='/foo')
def foo(docs: DocList[SimpleStrDoc], **kwargs) -> DocList[TextWithEmbedding]
ret = DocList[TextWithEmbedding]()
for doc in docs:
ret.append(TextWithEmbedding(text=doc.text, embedding=np.ramdom.rand(10))
return ret
class ProcessText(Executor):
@requests(on='/foo')
def foo(docs: DocList[SimpleStrDoc], **kwargs) -> DocList[TextWithEmbedding]
for doc in docs:
self.logger.info(f'Getting embedding with type {doc.text}')
# This Flow will fail to start because the input type of "process" does not match the output type of "embed"
flow = Flow().add(uses=TextEmbeddingExecutor, name='embed').add(uses=ProcessText, name='process')
with flow:
flow.block()
Client API#
Similarly, In the client, you specify the schema that you expect the Deployment or Flow to return. You can pass the return type by using the return_type
parameter in the client.post
method:
from jina import Deployment
with Deployment(uses=MyExec) as dep:
docs = dep.post(
on='/bar',
inputs=InputDoc(img=ImageDoc(tensor=np.zeros((3, 224, 224)))),
return_type=DocList[OutputDoc],
)
assert docs[0].embedding.shape == (100, 1)
assert docs.__class__.document_type == OutputDoc
Compatible features#
Jina is working to offer full compatibility with the new DocArray version.
However, there are currently some limitations to consider.
Note
With DocArray 0.30 support, Jina introduced the concept of input/output schema at the Executor level. To chain multiple Executors into a Flow you need to ensure that the output schema of an Executor is the same as the input of the Executor that follows it in the Flow
````{admonition} Note
:class: note
For now, [Executor Hub](https://cloud.jina.ai/executors] will not automatically build your Docker images with the new DocArray version. If this is needed, you need to provide your
Dockerfile where `docarray>=0.30` is specifically installed.