jina.types.arrays.neural_ops module

class jina.types.arrays.neural_ops.DocumentArrayNeuralOpsMixin[source]

Bases: object

A mixin that provides match functionality to DocumentArrays

match(darray, metric='cosine', limit=20, normalization=None, metric_name=None, batch_size=None, traversal_ldarray=None, traversal_rdarray=None, use_scipy=False, exclude_self=False, is_sparse=False)[source]

Compute embedding based nearest neighbour in another for each Document in self, and store results in matches. .. note:

'cosine', 'euclidean', 'sqeuclidean' are supported natively without extra dependency.
You can use other distance metric provided by ``scipy``, such as `braycurtis`, `canberra`, `chebyshev`,
`cityblock`, `correlation`, `cosine`, `dice`, `euclidean`, `hamming`, `jaccard`, `jensenshannon`,
`kulsinski`, `mahalanobis`, `matching`, `minkowski`, `rogerstanimoto`, `russellrao`, `seuclidean`,
`sokalmichener`, `sokalsneath`, `sqeuclidean`, `wminkowski`, `yule`.
To use scipy metric, please set ``use_scipy=True``.
  • To make all matches values in [0, 1], use dA.match(dB, normalization=(0, 1))

  • To invert the distance as score and make all values in range [0, 1],

    use dA.match(dB, normalization=(1, 0)). Note, how normalization differs from the previous.

Parameters
  • darray (Union[ForwardRef, ForwardRef]) – the other DocumentArray or DocumentArrayMemmap to match against

  • metric (Union[str, Callable[[ForwardRef, ForwardRef], ForwardRef]]) – the distance metric

  • limit (Union[int, float, None]) – the maximum number of matches, when not given defaults to 20.

  • normalization (Optional[Tuple[float, float]]) – a tuple [a, b] to be used with min-max normalization, the min distance will be rescaled to a, the max distance will be rescaled to b all values will be rescaled into range [a, b].

  • metric_name (Optional[str]) – if provided, then match result will be marked with this string.

  • batch_size (Optional[int]) – if provided, then darray is loaded in chunks of, at most, batch_size elements. This option will be slower but more memory efficient. Specialy indicated if darray is a big DocumentArrayMemmap.

  • traversal_ldarray (Optional[Sequence[str]]) – if set, then matching is applied along the traversal_path of the left-hand DocumentArray.

  • traversal_rdarray (Optional[Sequence[str]]) – if set, then matching is applied along the traversal_path of the right-hand DocumentArray.

  • use_scipy (bool) – if set, use scipy as the computation backend

  • exclude_self (bool) – if set, Documents in darray with same id as the left-hand values will not be considered as matches.

  • is_sparse (bool) – if set, the embeddings of left & right DocumentArray are considered as sparse NdArray

Return type

None

visualize(output=None, title=None, colored_tag=None, colormap='rainbow', method='pca', show_axis=False)[source]

Visualize embeddings in a 2D projection with the PCA algorithm. This function requires matplotlib installed.

If tag_name is provided the plot uses a distinct color for each unique tag value in the documents of the DocumentArray.

Parameters
  • output (Optional[str]) – Optional path to store the visualization. If not given, show in UI

  • title (Optional[str]) – Optional title of the plot. When not given, the default title is used.

  • colored_tag (Optional[str]) – Optional str that specifies tag used to color the plot

  • colormap (str) – the colormap string supported by matplotlib.

  • method (str) – the visualization method, available pca, tsne. pca is fast but may not well represent nonlinear relationship of high-dimensional data. tsne requires scikit-learn to be installed and is much slower.

  • show_axis (bool) – If set, axis and bounding box of the plot will be printed.