How to Use Incremental Indexing


When you might have a huge amount of data to index and the indexing process may take a while, You want to have the partial indexed documents searchable. In this case, you can use incremental indexing.

Feature Description

NumpyIndexer is the built-in vector indexer shipped with jina. When indexing Documents, NumpyIndexer stores the Documents in the file system. After indexing part of the dataset, one can start query the index with other advanced vector indexers, e.g. FaissIndexer, AnnoyIndexer, and etc. These advanced indexers can be built from the index of NumpyIndexer directly.


We start with indexing 5 Document with NumpyIndexer. With the following codes, the documents are stored in numpy_ws/numpy_vec.gz.
from jina import Flow, Document
import numpy as np

docs = [Document(text=f'doc{idx}', embedding=np.random.rand(10)) for idx in range(5)]

index_f = Flow().add(uses='numpy_idx.yml')

with index_f:

  index_filename: numpy_vec.gz
  name: numpy_idx
  workspace: numpy_ws
[email protected][I]:indexer size: 5 physical size: 0 Bytes

In the above step, we save the Flow config at flow.yml. Afterwards, we can build a Flow from the same config to load indexed documents and do query.
query_f = Flow.load_config('flow.yml')

with query_f:[Document(text=f'doc{idx}', embedding=np.random.rand(10)), ])

Now you might want to incrementally index another five documents.
docs = [Document(text=f'doc{idx+5}', embedding=np.random.rand(10)) for idx in range(5)]

index_f = Flow.load_config('flow.yml')

with index_f:
[email protected][I]:indexer size: 10 physical size: 3.1 KB


Query-while-indexing is not supported yet and therefore one can NOT doing indexing and querying with the same Flow at the same time.