jina.executors.indexers.vector

class jina.executors.indexers.vector.BaseNumpyIndexer(compress_level=1, ref_indexer=None, delete_on_dump=False, *args, **kwargs)[source]

Bases: jina.executors.indexers.BaseVectorIndexer

BaseNumpyIndexer stores and loads vector in a compresses binary file

Note

compress_level balances between time and space. By default, :classL`NumpyIndexer` has compress_level = 0.

Setting compress_level>0 gives a smaller file size on the disk in the index time. However, in the query time it loads all data into memory at once. Not ideal for large scale application.

Setting compress_level`=0 enables :func:`np.memmap, which loads data in an on-demand way and gives smaller memory footprint in the query time. However, it often gives larger file size on the disk.

Parameters
  • compress_level (int) – The compresslevel argument is an integer from 0 to 9 controlling the level of compression; 1 is fastest and produces the least compression, and 9 is slowest and produces the most compression. 0 is no compression at all. The default is 9.

  • ref_indexer (Optional[BaseNumpyIndexer]) – Bootstrap the current indexer from a ref_indexer. This enables user to switch the query algorithm at the query time.

  • delete_on_dump (bool) – whether to delete the rows marked as delete (see valid_indices)

property workspace_name

Get the workspace name.

property index_abspath

Get the file path of the index storage

Use index_abspath

Return type

str

get_add_handler()[source]

Open a binary gzip file for appending new vectors

Return type

BufferedWriter

Returns

a gzip file stream

get_create_handler()[source]

Create a new gzip file for adding new vectors. The old vectors are replaced.

Return type

BufferedWriter

Returns

a gzip file stream

add(keys, vectors, *args, **kwargs)[source]

Add the embeddings and document ids to the index.

Parameters
  • keys (Iterable[str]) – a list of id, i.e. doc.id in protobuf

  • vectors (ndarray) – embeddings

  • args – not used

  • kwargs – not used

Return type

None

update(keys, vectors, *args, **kwargs)[source]

Update the embeddings on the index via document ids.

Parameters
  • keys (Iterable[str]) – a list of id, i.e. doc.id in protobuf

  • vectors (ndarray) – embeddings

  • args – not used

  • kwargs – not used

Return type

None

delete(keys, *args, **kwargs)[source]

Delete the embeddings from the index via document ids.

Parameters
  • keys (Iterable[str]) – a list of id, i.e. doc.id in protobuf

  • args – not used

  • kwargs – not used

Return type

None

get_query_handler()[source]

Open a gzip file and load it as a numpy ndarray

Return type

Optional[ndarray]

Returns

a numpy ndarray of vectors

build_advanced_index(vecs)[source]

Not implemented here.

sample()[source]

Return a random entry from the indexer for sanity check.

Return type

Optional[bytes]

Returns

A random entry from the indexer.

query_by_key(keys, *args, **kwargs)[source]

Search the index by the external key (passed during .add().

Parameters
  • keys (Iterable[str]) – a list of id, i.e. doc.id in protobuf

  • args – not used

  • kwargs – not used

Return type

Optional[ndarray]

Returns

ndarray of vectors

class jina.executors.indexers.vector.NumpyIndexer(metric='cosine', backend='numpy', compress_level=0, *args, **kwargs)[source]

Bases: jina.executors.indexers.vector.BaseNumpyIndexer

An exhaustive vector indexers implemented with numpy and scipy.

Note

Metrics other than cosine and euclidean requires scipy installed.

Parameters
  • metric (str) – The distance metric to use. braycurtis, canberra, chebyshev, cityblock, correlation, cosine, dice, euclidean, hamming, jaccard, jensenshannon, kulsinski, mahalanobis, matching, minkowski, rogerstanimoto, russellrao, seuclidean, sokalmichener, sokalsneath, sqeuclidean, wminkowski, yule.

  • backend (str) – numpy or scipy, numpy only supports euclidean and cosine distance

  • compress_level (int) – compression level to use

batch_size = 512
query(vectors, top_k, *args, **kwargs)[source]

Find the top-k vectors with smallest metric and return their ids in ascending order.

Returns

a tuple of two ndarray. The first is ids in shape B x K (dtype=int), the second is metric in shape B x K (dtype=float)

Warning

This operation is memory-consuming.

Distance (the smaller the better) is returned, not the score.

Parameters
  • vectors (ndarray) – the vectors with which to search

  • args – not used

  • kwargs – not used

  • top_k (int) – nr of results to return

Return type

Tuple[ndarray, ndarray]

Returns

tuple of indices within matrix and distances

build_advanced_index(vecs)[source]

Build advanced index structure based on in-memory numpy ndarray, e.g. graph, tree, etc.

Parameters

vecs (ndarray) – The raw numpy ndarray.

Return type

ndarray

Returns

Advanced index.

class jina.executors.indexers.vector.VectorIndexer(metric='cosine', backend='numpy', compress_level=0, *args, **kwargs)[source]

Bases: jina.executors.indexers.vector.NumpyIndexer

Alias to NumpyIndexer