Incremental Indexing¶
Summary¶
When you might have a huge amount of data to index and the indexing process may take a while, You want to have the partial indexed documents searchable. In this case, you can use incremental indexing.
Feature Description¶
NumpyIndexer
is the built-in vector indexer shipped with jina. When indexing Documents, NumpyIndexer
stores the Documents in the file system. After indexing part of the dataset, one can start query the index with other advanced vector indexers, e.g. FaissIndexer
, AnnoyIndexer
, and etc. These advanced indexers can be built from the index of NumpyIndexer
directly.
Implementation¶
We start with indexing 5 Document with NumpyIndexer
. With the following codes, the documents are stored in numpy_ws/numpy_vec.gz
.
-
index_five_docs.py
¶ from jina import Flow, Document import numpy as np docs = [Document(text=f'doc{idx}', embedding=np.random.rand(10)) for idx in range(5)] index_f = Flow().add(uses='numpy_idx.yml') with index_f: index_f.index(docs) index_f.save_config('flow.yml')
-
numpy_idx.yml
¶ !NumpyIndexer with: index_filename: numpy_vec.gz metas: name: numpy_idx workspace: numpy_ws
...
[email protected][I]:indexer size: 5 physical size: 0 Bytes
...
In the above step, we save the Flow config at flow.yml
. Afterwards, we can build a Flow from the same config to load indexed documents and do query.
-
query_docs.py
¶ query_f = Flow.load_config('flow.yml') with query_f: query_f.search([Document(text=f'doc{idx}', embedding=np.random.rand(10)), ])
Now you might want to incrementally index another five documents.
-
incremental_indexing_docs.py
¶ docs = [Document(text=f'doc{idx+5}', embedding=np.random.rand(10)) for idx in range(5)] index_f = Flow.load_config('flow.yml') with index_f: index_f.index(docs)
...
[email protected][I]:indexer size: 10 physical size: 3.1 KB
...
Limitations¶
Query-while-indexing is not supported yet and therefore one can NOT doing indexing and querying with the same Flow at the same time.