jina.executors.indexers.vector

class jina.executors.indexers.vector.BaseNumpyIndexer(compress_level=1, ref_indexer=None, *args, **kwargs)[source]

Bases: jina.executors.indexers.BaseVectorIndexer

BaseNumpyIndexer stores and loads vector in a compresses binary file

Note

compress_level balances between time and space. By default, :classL`NumpyIndexer` has compress_level = 0.

Setting compress_level>0 gives a smaller file size on the disk in the index time. However, in the query time it loads all data into memory at once. Not ideal for large scale application.

Setting compress_level`=0 enables :func:`np.memmap, which loads data in an on-demanding way and gives smaller memory footprint in the query time. However, it often gives larger file size on the disk.

Parameters
  • compress_level (int) – The compresslevel argument is an integer from 0 to 9 controlling the level of compression; 1 is fastest and produces the least compression, and 9 is slowest and produces the most compression. 0 is no compression at all. The default is 9.

  • ref_indexer (Optional[BaseNumpyIndexer]) – Bootstrap the current indexer from a ref_indexer. This enables user to switch the query algorithm at the query time.

property index_abspath

Get the file path of the index storage

Use index_abspath

Return type

str

get_add_handler()[source]

Open a binary gzip file for adding new vectors

Returns

a gzip file stream

get_create_handler()[source]

Create a new gzip file for adding new vectors

Returns

a gzip file stream

add(keys, vectors, *args, **kwargs)[source]

Add new chunks and their vector representations

Parameters
  • keys (ndarray) – chunk_id in 1D-ndarray, shape B x 1

  • vectors (ndarray) – vector representations in B x D

Return type

None

get_query_handler()[source]

Open a gzip file and load it as a numpy ndarray

Return type

Optional[ndarray]

Returns

a numpy ndarray of vectors

build_advanced_index(vecs)[source]

Build advanced index structure based on in-memory numpy ndarray, e.g. graph, tree, etc.

Parameters

vecs (ndarray) – the raw numpy ndarray

Returns

raw_ndarray
query_by_id(ids, *args, **kwargs)[source]

Get the vectors by id, return a subset of indexed vectors

Parameters
  • ids (Union[List[int], ndarray]) – a list of id, i.e. doc.id in protobuf

  • args

  • kwargs

Return type

ndarray

Returns

int2ext_id
ext2int_id
class jina.executors.indexers.vector.NumpyIndexer(metric='euclidean', backend='numpy', compress_level=0, *args, **kwargs)[source]

Bases: jina.executors.indexers.vector.BaseNumpyIndexer

An exhaustive vector indexers implemented with numpy and scipy.

Parameters
  • metric (str) – The distance metric to use. braycurtis, canberra, chebyshev, cityblock, correlation, cosine, dice, euclidean, hamming, jaccard, jensenshannon, kulsinski, mahalanobis, matching, minkowski, rogerstanimoto, russellrao, seuclidean, sokalmichener, sokalsneath, sqeuclidean, wminkowski, yule.

  • backend (str) – numpy or scipy, numpy only supports euclidean and cosine distance

Note

Metrics other than cosine and euclidean requires scipy installed.

batch_size = 512
query(keys, top_k, *args, **kwargs)[source]

Find the top-k vectors with smallest metric and return their ids in ascending order.

Return type

Tuple[ndarray, ndarray]

Returns

a tuple of two ndarray. The first is ids in shape B x K (dtype=int), the second is metric in shape B x K (dtype=float)

Warning

This operation is memory-consuming.

Distance (the smaller the better) is returned, not the score.

build_advanced_index(vecs)[source]

Build advanced index structure based on in-memory numpy ndarray, e.g. graph, tree, etc.

Parameters

vecs (ndarray) – the raw numpy ndarray

Returns