In this section, you will get an overview of how Jina implements CRUD: Create, Read, Update, Delete¶
CRUD stands for Create, Read, Update, Delete. Together, they form the basis of any database engine. While Jina itself is not a database engine, it does provide these methods of interacting with the data stored in its indexes.
1.0, Jina only supported indexing (creating) and querying (reading) Documents. To update or delete a Document, you had to edit your dataset, and then rebuild the Flow and indexes. Needless to say, this could create problems if you had large datasets.
With the release of version
1.0 we are introducing update and delete operations. These are implemented across our Executors and Drivers, and allow you to update and delete Documents by their ids.
Before you start¶
Study the basics here
A basic example of this can be found in the
Flow class now supports
update methods, with a signature similar to
docs = random_docs(10) with f: f.index(input_fn=docs) new_docs = random_docs(10) with f: f.update(input_fn=new_docs) doc_ids = [d.id for d in docs] with f: delete_ids = [d.id for d in doc_ids] f.delete(delete_ids)
Note: deletion and update will happen by
id of the document.
Unfortunately there are some limitations to what Jina can do for the moment. These were trade-offs we needed to implement to keep Jina performant, robust, and easy to use. Some of these will be addressed in future versions, while some are intrinsic to Jina’s architecture.
For the moment we do not support partial updates. So if you want to update a document, you need to send the entire document. This is due to Jina’s architecture: the document is sent as one piece through the Flow.
In the context of Flows with segmenters and chunks, a Document may end up being split into chunks. Currently, the Update request will not work for these. You will need to manually remove the chunks by their
ids. Then you can remove the parent document as well, by its
id. Finally, you can index the new document, with its new contents (and thus new, different child chunks).
When sharding data in indexers, the data will be split across these. This is achieved due to the
polling: any configuration. During a query, you will need to set
polling: all. However, this will lead to some shards getting a query request with a key that doesn’t exist. In this case, warnings will be emitted by the indexer. You can ignore these within this context. The warnings are there for the situations when missing keys are not expected.
Indexing while querying
The index, update, and delete operations cannot be executed within the same context as the query operation. This is due to the way flushing to disk works within the Flow and Executor context lifecycle. This is applicable across all VectorIndexers. Thus, you need to exit the Flow context when you want to switch from one set of operations to the other.
You can see this in the code listing in the beginning of this chapter.
The update and delete operations use a masking underneath. This is done to maintain high performance overall. However, this means that old data will not be deleted, but will simply be masked as being deleted. Thus the size on disk (and in memory) of the indexer will grow over time if you perform update or delete operations. We recommend you rebuild the indexers regularly.