Migration to Jina 3#
Jina 3 comes with many improvements but to be able to enjoy them, you will also have to make some tweaks to your existing Jina 2 code.
One of the major changes in Jina 3 is DocArray being an external dependency:
The previously included Document
and DocumentArray
data structures now form their own library and include new
features, improved performance, and increased flexibility.
Accordingly, most of the breaking changes that users will experience when updating to Jina 3 are mainly related to Document
and DocumentArray
.
DocArray library
DocArray is our new library that includes the Document
and DocumentArray
data
structures. Inside their own library, Document
and DocumentArray
are faster and more versatile than ever, and
underpin neural search apps as well as the Jina ecosystem, including Jina and
Finetuner.
In general, the breaking changes are aiming for increased simplicity and consistency, making your life easier in the long run. Here you can find out what exactly you will have to adapt.
Simple changes at a glance#
Many of the changes introduced in Jina 3 are easily adapted to a Jina 2 codebase. The modifications in the following table should, in most cases, be safe to perform without further thought or effort.
Jina 2 | Jina 3 |
---|---|
doc.blob |
doc.tensor |
doc.buffer |
doc.blob |
docs.get_attributes('attribute') |
docs[:, 'attribute'] |
['path1', 'path2'] |
'path1,path2' |
docs.traverse_flat(paths) |
docs['@paths'] |
docs.flatten() |
docs[...] |
doc.SerializeToString() |
doc.to_bytes() |
Document(bytes) |
Document.from_bytes() |
from jina import Document, DocumentArray |
from docarray import Document, DocumentArray |
There are, however, some more nuanced changes in Jina 3 as well. These are outlined below.
Document: More natural attribute names and Pythonic serialization#
Docarray introduces more natural naming conventions for Document
and DocumentArray
attributes.
doc.blob
is renamed todoc.tensor
, to align with external libraries like PyTorch and Tensorflowdoc.buffer
is renamed todoc.blob
, to align with the industry standarddoc.SerializeToString()
is removed in favour ofdoc.to_bytes()
anddoc.to_json()
Creating a
Document
from serialized data usingDocument(bytes)
is removed in favour ofDocument.from_bytes(bytes)
andDocument.from_json(bytes)
DocumentArray: Simplified attribute, element access and new storage options#
DocArray library
Take a look at the DocArray documentation
to have a better understanding of accessing attributes and elements with DocArray
Attributes: Docarray introduces a flexible way of bulk-accessing attributes of Document
s in a DocumentArray
.
Instead of having to call
docs.get_attributes('attribute')
, you can simply calldocs.attributes
for a select number of attributes. Currently, this syntax is supported by:text
:docs.texts
blob
:docs.blobs
tensor
:docs.tensors
content
:docs.contents
embedding
:docs.embeddings
The remaining attributes can be accessed in bulk by calling
docs[:, 'attribute']
, e.g.docs[:, 'tags']
. Additionally, you can access a specific key intags
by callingdocs[:, 'tags__key']
.
Array traversal: For traversing DocumentArray
s via a traversal_path
, docarray introduces a simplified notation
Traversal paths of the form
[path1, path2]
(e.g.['r', 'cm']
) are replaced by a single string of the form'path1,path2'
(e.g.'r,cm'
)docs.traverse_flat(path)
is replaced bydocs['@path']
(e.g.docs['@r,cm']
)docs.flatten()
is replaced bydocs[...]
from jina import Document, DocumentArray
docs = nested_docs()
print(docs.traverse_flat('r,c').texts)
>>> ['root1', 'rooot2', 'chunk11', 'chunk12', 'chunk21', 'chunk22']
print(docs.flatten().texts)
>>> ['chunk11', 'chunk12', 'root1', 'chunk21', 'chunk22', 'root2']
from docarray import Document, DocumentArray
docs = nested_docs()
print(docs['@r,c'].texts)
>>> ['root1', 'rooot2', 'chunk11', 'chunk12', 'chunk21', 'chunk22']
print(docs[...].texts)
>>> ['chunk11', 'chunk12', 'root1', 'chunk21', 'chunk22', 'root2']
Loading data from files: DocumentArray introduces a .from_files()
class method which can be used directly instead of
importing a from_files()
function.
from jina import Document, DocumentArray
from jina.types.document.generators import from_files
docs = DocumentArray(from_files('path/to/files'))
from docarray import Document, DocumentArray
docs = DocumentArray.from_files('path/to/files')
Batching: Batching operations are delegated to the docarray package and Python builtins:
docs.batch()
does not accept the argumentstraversal_paths=
andrequire_attr=
anymore. The example below shows how to achieve complex behavior that previously relied on these arguments, in a more Pythonic and Jina 3 compatible way:
docs.batch(traversal_paths=paths, batch_size=bs, require_attr='attr')
DocumentArray(filter(lambda x: bool(x.attr), docs['@paths'])).batch(batch_size=bs)
Accessing non-existent values: In Jina 2, bulk-accessing attributes in a DocumentArray
returns a list of empty values, when the Document
s
inside the DocumentArray
do not have a value for that attribute. In Jina 3, this returns None
. This change becomes
important when migrating code that checks for the presence of a certain attribute.
from jina import Document, DocumentArray
d = Document()
print(d.text)
>>> ''
docs = DocumentArray([d, d])
print(docs.texts)
>>> ['', '']
from docarray import Document, DocumentArray
d = Document()
print(d.text)
>>> ''
docs = DocumentArray([d, d])
print(docs.texts)
>>> None
Serialization: DocumentArray
introduces the same Pythonic serialization syntax as Document
.
docs.SerializeToString()
is removed in favour ofdocs.to_bytes()
anddocs.to_json()
Creating a
DocumentArray
from serialized data usingDocumentArray(bytes)
is removed in favour ofDocumentArray.from_bytes(bytes)
andDocumentArray.from_json(bytes)
New storage options:
Jina 2 used to offer persistence of DocumentArray through DocumentArrayMemmap
. In Jina 3, this data structure is
deprecated and we introduce different Document Stores within the
DocumentArray
API. Thus, you can enjoy a consistent DocumentArray
API across different storage backends and leverage
modern databases.
For example, you can use SQLite backend as a replacement
for DocumentArrayMemmap
, which lets you persist Documents to disk and load them in another session:
from docarray import Document, DocumentArray
das = DocumentArray(
storage='sqlite',
config={'connection': 'my_connection', 'table_name': 'my_table_name'},
)
das.extend([Document() for _ in range(10)])
from docarray import DocumentArray
das = DocumentArray(
storage='sqlite',
config={'connection': 'my_connection', 'table_name': 'my_table_name'},
)
print(len(das))
10
The API is almost the same as the deprecated DocumentArrayMemmap
and is consistent across storage backends and
in-memory storage. Furthermore, some Document Stores offer fast Nearest Neighbor algorithms and are more convenient in
production.
Flow and Client: Simplified .post()
behavior#
client.post()
and flow.post()
now return a flattened DocumentArray
instead of a list of Response
s when no
callback function is specified.
.post()
can still be configured to return a list of Responses, by passing return_responses=True
to the Client or Flow
constructors.
Consistent YAML parsing syntax#
In Jina 3, YAML syntax is aligned with Github Actions notation, which leads to the following changes:
Referencing environment variables using the syntax
${{ VAR }}
is no longer allowed. The POSIX notations for environment variables,$var
, has been deprecated. Instead, use${{ ENV.VAR }}
.The syntax
${{ VAR }}
now defaults to signifying a context variable, passed in adict()
. If you want to be explicit about the use of context variables, you can use${{ CONTEXT.VAR }}
.Relative paths can point to other variables within the same
.yaml
file, and can be references using the syntax${{root.path.to.var}}
.
Environment variables vs. relative paths
Note that the only difference between and environment variable and relative path syntax is the inclusion of spaces in
the former (${{ var }}
), and the omission of spaces in the latter (${{path}}
).
Common errors and solutions#
AttributeError: 'Document' object has no attribute 'buffer'
Solution
Replace doc.buffer
with doc.blob
in your entire codebase
RuntimeError: Could not infer dtype of NoneType
while performing doc.embed()
Solution
Replace doc.blob
with doc.tensor
in your entire codebase
AttributeError: 'DocumentArray' object has no attribute 'get_attributes'
Solution
Replace docs.get_attributes('attribute')
with docs[:, 'attribute']
AttributeError: 'Document' object has no attribute 'SerializeToString'
Solution
Replace doc.SerializeToString
with doc.to_bytes
or doc.to_json
ValueError: Failed to initialize docarray.document.Document from obj=b"..."
Solution
Replace Document(bytes)
with Document.from_bytes(bytes)
TypeError: batch() got an unexpected keyword argument 'traversal_paths'
Solution
Replace docs.batch(traversal_path='path', batch_size=bs)
with docs['@path'].batch(batch_size=bs)
TypeError: batch() got an unexpected keyword argument 'require_attr'
Solution
Replace docs.batch(traversal_path='path', require_attr='attr')
with
DocumentArray(filter(lambda x: bool(x.attr)), docs).batch(batch_size=bs)
AttributeError: 'Document' object has no attribute 'docs'
when operating on the output of flow.post()
Solution
Remove resp[i].docs
as flow.post()
already returns a DocumentArray