jina.types.document

class jina.types.document.Document(document=None, copy=False, **kwargs)[source]

Bases: object

Document is one of the primitive data type in Jina.

It offers a Pythonic interface to allow users access and manipulate jina.jina_pb2.DocumentProto object without working with Protobuf itself.

To create a Document object, simply:

from jina import Document
d = Document()
d.text = 'abc'

Jina requires each Document to have a string id. You can set a custom one, or if non has been set a random one will be assigned.

Or you can use Document as a context manager:

with Document() as d:
    d.text = 'hello'

assert d.id  # now `id` has value

To access and modify the content of the document, you can use text, blob, and buffer. Each property is implemented with proper setter, to improve the integrity and user experience. For example, assigning doc.blob or doc.embedding can be simply done via:

import numpy as np

# to set as content
d.content = np.random.random([10, 5])

# to set as embedding
d.embedding = np.random.random([10, 5])

MIME type is auto set/guessed when setting content and uri

Document also provides multiple way to build from existing Document. You can build Document from jina_pb2.DocumentProto, bytes, str, and Dict. You can also use it as view (i.e. weak reference when building from an existing jina_pb2.DocumentProto). For example,

a = DocumentProto()
b = Document(a, copy=False)
a.text = 'hello'
assert b.text == 'hello'

You can leverage the convert_a_to_b() interface to convert between content forms.

Parameters
  • document (Optional[~DocumentSourceType]) – the document to construct from. If bytes is given then deserialize a DocumentProto; dict is given then parse a DocumentProto from it; str is given, then consider it as a JSON string and parse a DocumentProto from it; finally, one can also give DocumentProto directly, then depending on the copy, it builds a view or a copy from it.

  • copy (bool) – when document is given as a DocumentProto object, build a view (i.e. weak reference) from it or a deep copy from it.

  • kwargs – other parameters to be set

property length
Return type

int

property weight

Returns the weight of the document

Return type

float

property modality

Get the modality of the document

Return type

str

property content_hash
update_content_hash(exclude_fields=('id', 'chunks', 'matches', 'content_hash'), include_fields=None)[source]

Update the document hash according to its content.

Parameters
  • exclude_fields (Optional[Tuple[str]]) – a tuple of field names that excluded when computing content hash

  • include_fields (Optional[Tuple[str]]) – a tuple of field names that included when computing content hash

Note

“exclude_fields” and “include_fields” are mutually exclusive, use one only

Return type

None

property id

The document id in hex string, for non-binary environment such as HTTP, CLI, HTML and also human-readable. it will be used as the major view.

Return type

UniqueId

property parent_id

The document’s parent id in hex string, for non-binary environment such as HTTP, CLI, HTML and also human-readable. it will be used as the major view.

Return type

UniqueId

property blob

Return blob, one of the content form of a Document.

Note

Use content to return the content of a Document

Return type

ndarray

property embedding

Return embedding of the content of a Document.

Return type

ndarray

property matches

Get all matches of the current document

Return type

MatchSet

property chunks

Get all chunks of the current document

Return type

ChunkSet

set_attrs(**kwargs)[source]

Bulk update Document fields with key-value specified in kwargs

See also

get_attrs() for bulk get attributes

get_attrs(*args)[source]

Bulk fetch Document fields and return a dict of the key-value pairs

See also

update() for bulk set/update attributes

Return type

Dict[str, Any]

property as_pb_object
Return type

DocumentProto

property buffer

Return buffer, one of the content form of a Document.

Note

Use content to return the content of a Document

Return type

bytes

property text

Return text, one of the content form of a Document.

Note

Use content to return the content of a Document

property uri
Return type

str

property mime_type

Get MIME type of the document

Return type

str

property content_type

Return the content type of the document, possible values: text, blob, buffer

Return type

str

property content

Return the content of the document. It checks whichever field among blob, text, buffer has value and return it.

See also

blob, buffer, text

Return type

~DocumentContentType

property granularity
property score
convert_buffer_to_blob(**kwargs)[source]

Assuming the buffer is a _valid_ buffer of Numpy ndarray, set blob accordingly.

Parameters

kwargs – reserved for maximum compatibility when using with ConvertDriver

Note

One can only recover values not shape information from pure buffer.

convert_blob_to_uri(width, height, resize_method='BILINEAR', **kwargs)[source]

Assuming blob is a _valid_ image, set uri accordingly

convert_uri_to_buffer(**kwargs)[source]

Convert uri to buffer Internally it downloads from the URI and set buffer.

Parameters

kwargs – reserved for maximum compatibility when using with ConvertDriver

convert_uri_to_data_uri(charset='utf-8', base64=False, **kwargs)[source]

Convert uri to data uri. Internally it reads uri into buffer and convert it to data uri

Parameters
  • charset (str) – charset may be any character set registered with IANA

  • base64 (bool) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.

  • kwargs – reserved for maximum compatibility when using with ConvertDriver

convert_buffer_to_uri(charset='utf-8', base64=False, **kwargs)[source]

Convert buffer to data uri. Internally it first reads into buffer and then converts it to data URI.

Parameters
  • charset (str) – charset may be any character set registered with IANA

  • base64 (bool) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.

  • kwargs – reserved for maximum compatibility when using with ConvertDriver

convert_text_to_uri(charset='utf-8', base64=False, **kwargs)[source]

Convert text to data uri.

Parameters
  • charset (str) – charset may be any character set registered with IANA

  • base64 (bool) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit.

Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters. :param kwargs: reserved for maximum compatibility when using with ConvertDriver

convert_uri_to_text(**kwargs)[source]

Assuming URI is text, convert it to text

Parameters

kwargs – reserved for maximum compatibility when using with ConvertDriver

convert_content_to_uri(**kwargs)[source]

Convert content in URI with best effort

Parameters

kwargs – reserved for maximum compatibility when using with ConvertDriver

MergeFrom(doc)[source]
CopyFrom(doc)[source]
traverse(traversal_path, callback_fn, *args, **kwargs)[source]

Traverse leaves of the document.

Return type

None