jina.types.document package

Submodules

Module contents

class jina.types.document.Document(*, adjacency: Optional[int] = 'None', blob: Optional[Union[ArrayType, jina_pb2.NdArrayProto, NdArray]] = 'None', buffer: Optional[bytes] = 'None', chunks: Optional[Iterable[Document]] = 'None', content: Optional[jina.types.document.DocumentContentType] = 'None', embedding: Optional[Union[ArrayType, jina_pb2.NdArrayProto, NdArray]] = 'None', granularity: Optional[int] = 'None', id: Optional[str] = 'None', matches: Optional[Iterable[Document]] = 'None', mime_type: Optional[str] = 'None', modality: Optional[str] = 'None', parent_id: Optional[str] = 'None', tags: Optional[Union[Dict, jina.types.struct.StructView]] = 'None', text: Optional[str] = 'None', uri: Optional[str] = 'None', weight: Optional[float] = 'None', **kwargs)[source]

Bases: jina.types.mixin.ProtoTypeMixin, jina.types.document.helper.VersionedMixin

Document is one of the primitive data type in Jina.

It offers a Pythonic interface to allow users access and manipulate jina.jina_pb2.DocumentProto object without working with Protobuf itself.

To create a Document object, simply:

from jina import Document
d = Document()
d.text = 'abc'

Jina requires each Document to have a string id. You can set a custom one, or if non has been set a random one will be assigned.

To access and modify the content of the document, you can use text, blob, and buffer. Each property is implemented with proper setter, to improve the integrity and user experience. For example, assigning doc.blob or doc.embedding can be simply done via:

import numpy as np

# to set as content
d.content = np.random.random([10, 5])

# to set as embedding
d.embedding = np.random.random([10, 5])

MIME type is auto set/guessed when setting content and uri

Document also provides multiple way to build from existing Document. You can build Document from jina_pb2.DocumentProto, bytes, str, and Dict. You can also use it as view (i.e. weak reference when building from an existing jina_pb2.DocumentProto). For example,

a = DocumentProto()
b = Document(a, copy=False)
a.text = 'hello'
assert b.text == 'hello'

You can leverage the convert_a_to_b() interface to convert between content forms.

Parameters
  • document (Optional[~DocumentSourceType]) – the document to construct from. If bytes is given then deserialize a DocumentProto; dict is given then parse a DocumentProto from it; str is given, then consider it as a JSON string and parse a DocumentProto from it; finally, one can also give DocumentProto directly, then depending on the copy, it builds a view or a copy from it.

  • copy (bool) – when document is given as a DocumentProto object, build a view (i.e. weak reference) from it or a deep copy from it.

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • kwargs – other parameters to be set _after_ the document is constructed

Note

When document is a JSON string or Python dictionary object, the constructor will only map the values from known fields defined in Protobuf, all unknown fields are mapped to document.tags. For example,

d = Document({'id': '123', 'hello': 'world', 'tags': {'good': 'bye'}})

assert d.id == '123'  # true
assert d.tags['hello'] == 'world'  # true
assert d.tags['good'] == 'bye'  # true
ON_GETATTR: List = ['matches', 'chunks']
pop(*fields)[source]

Remove the values from the given fields of this Document.

Parameters

fields – field names

Return type

None

clear()[source]

Remove all values from all fields of this Document.

Return type

None

property weight: float
Return type

float

Returns

the weight of the document

property modality: str
Return type

str

Returns

the modality of the document.

property tags: jina.types.struct.StructView

Return the tags field of this Document as a Python dict

Return type

StructView

Returns

a Python dict view of the tags.

update(source, fields=None)[source]

Updates fields specified in fields from the source to current Document.

Parameters
  • source (Document) – The Document we want to update from as source. The current Document is referred as destination.

  • fields (Optional[List[str]]) – a list of field names that we want to update, if not specified, use all present fields in source.

Note

*. if fields are empty, then all present fields in source will be merged into current document. * tags will be updated like a python dict. *. the current Document will be modified in place, source will be unchanged. *. if current document has more fields than source, these extra fields wll be preserved.

Return type

None

property content_hash: str

Get the document hash according to its content.

Return type

str

Returns

the unique hash code to represent this Document

property id: str

The document id in string.

Return type

str

Returns

the id of this Document

property parent_id: str

The document’s parent id in string.

Return type

str

Returns

the parent id of this Document

get_sparse_blob(sparse_ndarray_cls_type, **kwargs)[source]

Return blob of the content of a Document as an sparse array.

Parameters
  • sparse_ndarray_cls_type (Type[BaseSparseNdArray]) – Sparse class type, such as SparseNdArray.

  • kwargs – Additional key value argument, for scipy backend, we need to set the keyword sp_format as one of the scipy supported sparse format, such as coo or csr.

Return type

SparseArrayType

Returns

the blob of this Document but as an sparse array

property blob: ArrayType

Return blob, one of the content form of a Document.

Note

Use content to return the content of a Document

This property will return the blob of the Document as a Dense or Sparse array depending on the actual proto instance stored. In the case where the blob stored is sparse, it will return them as a coo matrix. If any other type of sparse type is desired, use the :meth:`get_sparse_blob.

Return type

ArrayType

Returns

the blob content of thi Document

get_sparse_embedding(sparse_ndarray_cls_type, **kwargs)[source]

Return embedding of the content of a Document as an sparse array.

Parameters
  • sparse_ndarray_cls_type (Type[BaseSparseNdArray]) – Sparse class type, such as SparseNdArray.

  • kwargs – Additional key value argument, for scipy backend, we need to set the keyword sp_format as one of the scipy supported sparse format, such as coo or csr.

Return type

SparseArrayType

Returns

the embedding of this Document but as as an sparse array

property embedding: SparseArrayType

Return embedding of the content of a Document.

Note

This property will return the embedding of the Document as a Dense or Sparse array depending on the actual proto instance stored. In the case where the embedding stored is sparse, it will return them as a coo matrix. If any other type of sparse type is desired, use the :meth:`get_sparse_embedding.

Return type

SparseArrayType

Returns

the embedding of this Document

property matches: MatchArray

Get all matches of the current document.

Return type

MatchArray

Returns

the array of matches attached to this document

property chunks: ChunkArray

Get all chunks of the current document.

Return type

ChunkArray

Returns

the array of chunks of this document

set_attributes(**kwargs)[source]

Bulk update Document fields with key-value specified in kwargs

See also

get_attributes() for bulk get attributes

Parameters

kwargs – the keyword arguments to set the values, where the keys are the fields to set

get_attributes(*fields)[source]

Bulk fetch Document fields and return a list of the values of these fields

Note

Arguments will be extracted using dunder_get .. highlight:: python .. code-block:: python

d = Document({‘id’: ‘123’, ‘hello’: ‘world’, ‘tags’: {‘id’: ‘external_id’, ‘good’: ‘bye’}})

assert d.id == ‘123’ # true assert d.tags[‘hello’] == ‘world’ # true assert d.tags[‘good’] == ‘bye’ # true assert d.tags[‘id’] == ‘external_id’ # true

res = d.get_attrs_values(*[‘id’, ‘tags__hello’, ‘tags__good’, ‘tags__id’])

assert res == [‘123’, ‘world’, ‘bye’, ‘external_id’]

Parameters

fields (str) – the variable length values to extract from the document

Return type

Union[Any, List[Any]]

Returns

a list with the attributes of this document ordered as the args

property buffer: bytes

Return buffer, one of the content form of a Document.

Note

Use content to return the content of a Document

Return type

bytes

Returns

the buffer bytes from this document

property text

Return text, one of the content form of a Document.

Note

Use content to return the content of a Document

Returns

the text from this document content

property uri: str

Return the URI of the document.

Return type

str

Returns

the uri of this Document

property mime_type: str

Get MIME type of the document

Return type

str

Returns

the mime_type of this Document

property content_type: str

Return the content type of the document, possible values: text, blob, buffer

Return type

str

Returns

the type of content of this Document

property content: jina.types.document.DocumentContentType

Return the content of the document. It checks whichever field among blob, text, buffer has value and return it.

See also

blob, buffer, text

Return type

~DocumentContentType

Returns

the value of the content depending on :meth:`content_type

property granularity

Return the granularity of the document.

Returns

the granularity of this Document

property adjacency

Return the adjacency of the document.

Returns

the adjacency of this Document

property scores

Return the scores of the document.

Returns

the scores attached to this document as :class:NamedScoreMapping

property evaluations

Return the evaluations of the document.

Returns

the evaluations attached to this document as :class:NamedScoreMapping

convert_image_buffer_to_blob(color_axis=- 1)[source]

Convert an image buffer to blob

Parameters

color_axis (int) – the axis id of the color channel, -1 indicates the color channel info at the last axis

convert_image_blob_to_uri(width=None, height=None, resize_method='BILINEAR', color_axis=- 1)[source]

Assuming blob is a _valid_ image, set uri accordingly :type width: Optional[int] :param width: the width of the blob, if None, interpret from blob shape. :type height: Optional[int] :param height: the height of the blob, if None, interpret from blob shape. :type resize_method: str :param resize_method: the resize method name :type color_axis: int :param color_axis: the axis id of the color channel, -1 indicates the color channel info at the last axis

..note::

if both width and height were provided, will not resize. Otherwise, will get image size by self.blob shape and apply resize method resize_method.

convert_image_uri_to_blob(color_axis=- 1, uri_prefix=None)[source]

Convert uri to blob

Parameters
  • color_axis (int) – the axis id of the color channel, -1 indicates the color channel info at the last axis

  • uri_prefix (Optional[str]) – the prefix of the uri

convert_image_datauri_to_blob(color_axis=- 1)[source]

Convert data URI to image blob

Parameters

color_axis (int) – the axis id of the color channel, -1 indicates the color channel info at the last axis

convert_buffer_to_blob(dtype=None, count=- 1, offset=0)[source]

Assuming the buffer is a _valid_ buffer of Numpy ndarray, set blob accordingly.

Parameters
  • dtype – Data-type of the returned array; default: float.

  • count – Number of items to read. -1 means all data in the buffer.

  • offset – Start reading the buffer from this offset (in bytes); default: 0.

Note

One can only recover values not shape information from pure buffer.

convert_blob_to_buffer()[source]

Convert blob to buffer

convert_uri_to_buffer()[source]

Convert uri to buffer Internally it downloads from the URI and set buffer.

convert_uri_to_datauri(charset='utf-8', base64=False)[source]

Convert uri to data uri. Internally it reads uri into buffer and convert it to data uri

Parameters
  • charset (str) – charset may be any character set registered with IANA

  • base64 (bool) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.

convert_buffer_to_uri(charset='utf-8', base64=False)[source]

Convert buffer to data uri. Internally it first reads into buffer and then converts it to data URI.

Parameters
  • charset (str) – charset may be any character set registered with IANA

  • base64 (bool) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.

convert_text_to_uri(charset='utf-8', base64=False)[source]

Convert text to data uri.

Parameters
  • charset (str) – charset may be any character set registered with IANA

  • base64 (bool) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.

convert_uri_to_text()[source]

Assuming URI is text, convert it to text

convert_content_to_uri()[source]

Convert content in URI with best effort

MergeFrom(doc)[source]

Merge the content of target

Parameters

doc (Document) – the document to merge from

CopyFrom(doc)[source]

Copy the content of target

Parameters

doc (Document) – the document to copy from

plot(output=None, inline_display=False)[source]

Visualize the Document recursively.

Parameters
  • output (Optional[str]) – a filename specifying the name of the image to be created, the suffix svg/jpg determines the file type of the output image

  • inline_display (bool) – show image directly inside the Jupyter Notebook

Return type

None

dict(prettify_ndarrays=False, *args, **kwargs)[source]

Return the object in Python dictionary

Parameters
  • prettify_ndarrays – boolean indicating if the ndarrays need to be prettified to be shown as lists of values

  • args – Extra positional arguments

  • kwargs – Extra keyword arguments

Returns

dict representation of the object

json(prettify_ndarrays=False, *args, **kwargs)[source]

Return the object in JSON string

Parameters
  • prettify_ndarrays – boolean indicating if the ndarrays need to be prettified to be shown as lists of values

  • args – Extra positional arguments

  • kwargs – Extra keyword arguments

Returns

JSON string of the object

property non_empty_fields: Tuple[str]

Return the set fields of the current document that are not empty

Return type

Tuple[str]

Returns

the tuple of non-empty fields

static attributes(include_proto_fields=True, include_proto_fields_camelcase=False, include_properties=False)[source]

Return all attributes supported by the Document, which can be accessed by doc.attribute

Parameters
  • include_proto_fields (bool) – if set, then include all protobuf fields

  • include_proto_fields_camelcase (bool) – if set, then include all protobuf fields in CamelCase

  • include_properties (bool) – if set, then include all properties defined for Document class

Return type

List[str]

Returns

a list of attributes in string.