jina.types.arrays.mixins.match module

class jina.types.arrays.mixins.match.MatchMixin[source]

Bases: object

A mixin that provides match functionality to DocumentArrays

match(darray, metric='cosine', limit=20, normalization=None, metric_name=None, batch_size=None, traversal_ldarray=None, traversal_rdarray=None, exclude_self=False, filter_fn=None, only_id=False, use_scipy=False, device='cpu', num_worker=None, **kwargs)[source]

Compute embedding based nearest neighbour in another for each Document in self, and store results in matches. .. note:

'cosine', 'euclidean', 'sqeuclidean' are supported natively without extra dependency.
You can use other distance metric provided by ``scipy``, such as `braycurtis`, `canberra`, `chebyshev`,
`cityblock`, `correlation`, `cosine`, `dice`, `euclidean`, `hamming`, `jaccard`, `jensenshannon`,
`kulsinski`, `mahalanobis`, `matching`, `minkowski`, `rogerstanimoto`, `russellrao`, `seuclidean`,
`sokalmichener`, `sokalsneath`, `sqeuclidean`, `wminkowski`, `yule`.
To use scipy metric, please set ``use_scipy=True``.
  • To make all matches values in [0, 1], use dA.match(dB, normalization=(0, 1))

  • To invert the distance as score and make all values in range [0, 1],

    use dA.match(dB, normalization=(1, 0)). Note, how normalization differs from the previous.

  • If a custom metric distance is provided. Make sure that it returns scores as distances and not similarity, meaning the smaller the better.

Parameters
  • darray (Union[ForwardRef, ForwardRef]) – the other DocumentArray or DocumentArrayMemmap to match against

  • metric (Union[str, Callable[[ForwardRef, ForwardRef], ForwardRef]]) – the distance metric

  • limit (Union[int, float, None]) – the maximum number of matches, when not given defaults to 20.

  • normalization (Optional[Tuple[float, float]]) – a tuple [a, b] to be used with min-max normalization, the min distance will be rescaled to a, the max distance will be rescaled to b all values will be rescaled into range [a, b].

  • metric_name (Optional[str]) – if provided, then match result will be marked with this string.

  • batch_size (Optional[int]) – if provided, then darray is loaded in batches, where each of them is at most batch_size elements. When darray is big, this can significantly speedup the computation.

  • traversal_ldarray (Optional[Sequence[str]]) – DEPRECATED. if set, then matching is applied along the traversal_path of the left-hand DocumentArray.

  • traversal_rdarray (Optional[Sequence[str]]) – DEPRECATED. if set, then matching is applied along the traversal_path of the right-hand DocumentArray.

  • filter_fn (Optional[Callable[[ForwardRef], bool]]) – DEPRECATED. if set, apply the filter function to filter docs on the right hand side (rhv) to be matched

  • exclude_self (bool) – if set, Documents in darray with same id as the left-hand values will not be considered as matches.

  • only_id (bool) – if set, then returning matches will only contain id

  • use_scipy (bool) – if set, use scipy as the computation backend. Note, scipy does not support distance on sparse matrix.

  • device (str) – the computational device for .match(), can be either cpu or cuda.

  • num_worker (Optional[int]) –

    the number of parallel workers. If not given, then the number of CPUs in the system will be used.

    Note

    This argument is only effective when batch_size is set.

  • kwargs – other kwargs.

Return type

None