jina.types.document¶
-
class
jina.types.document.
Document
(document=None, field_resolver=None, copy=False, **kwargs)[source]¶ Bases:
jina.types.mixin.ProtoTypeMixin
,jina.types.document.traversable.Traversable
Document
is one of the primitive data type in Jina.It offers a Pythonic interface to allow users access and manipulate
jina.jina_pb2.DocumentProto
object without working with Protobuf itself.To create a
Document
object, simply:from jina import Document d = Document() d.text = 'abc'
Jina requires each Document to have a string id. You can set a custom one, or if non has been set a random one will be assigned.
Or you can use
Document
as a context manager:with Document() as d: d.text = 'hello' assert d.id # now `id` has value
To access and modify the content of the document, you can use
text
,blob
, andbuffer
. Each property is implemented with proper setter, to improve the integrity and user experience. For example, assigningdoc.blob
ordoc.embedding
can be simply done via:import numpy as np # to set as content d.content = np.random.random([10, 5]) # to set as embedding d.embedding = np.random.random([10, 5])
MIME type is auto set/guessed when setting
content
anduri
Document
also provides multiple way to build from existing Document. You can buildDocument
fromjina_pb2.DocumentProto
,bytes
,str
, andDict
. You can also use it as view (i.e. weak reference when building from an existingjina_pb2.DocumentProto
). For example,a = DocumentProto() b = Document(a, copy=False) a.text = 'hello' assert b.text == 'hello'
You can leverage the
convert_a_to_b()
interface to convert between content forms.- Parameters
document (
Optional
[~DocumentSourceType]) – the document to construct from. Ifbytes
is given then deserialize aDocumentProto
;dict
is given then parse aDocumentProto
from it;str
is given, then consider it as a JSON string and parse aDocumentProto
from it; finally, one can also give DocumentProto directly, then depending on thecopy
, it builds a view or a copy from it.copy (
bool
) – whendocument
is given as aDocumentProto
object, build a view (i.e. weak reference) from it or a deep copy from it.field_resolver (
Optional
[Dict
[str
,str
]]) – a map from field names defined indocument
(JSON, dict) to the field names defined in Protobuf. This is only used when the givendocument
is a JSON string or a Python dict.kwargs – other parameters to be set _after_ the document is constructed
Note
When
document
is a JSON string or Python dictionary object, the constructor will only map the values from known fields defined in Protobuf, all unknown fields are mapped todocument.tags
. For example,d = Document({'id': '123', 'hello': 'world', 'tags': {'good': 'bye'}}) assert d.id == '123' # true assert d.tags['hello'] == 'world' # true assert d.tags['good'] == 'bye' # true
-
property
siblings
¶ The number of siblings of the :class:
Document
- Getter
number of siblings
- Setter
number of siblings
- Type
int
- Return type
int
-
property
weight
¶ - Return type
float
- Returns
the weight of the document
-
property
modality
¶ - Return type
str
- Returns
the modality of the document.
-
property
content_hash
¶ Get the content hash of the document.
- Returns
the content_hash from the proto
-
update
(source, exclude_fields=None, include_fields=None)[source]¶ Updates fields specified in
include_fields
from the source to current Document.- Parameters
exclude_fields (
Optional
[Tuple
[str
, …]]) – a tuple of field names that excluded from the current document, when not given the non-empty fields of the current document is considered asexclude_fields
include_fields (
Optional
[Tuple
[str
, …]]) – a tuple of field names that included from the source document
Note
*.
destination
will be modified in place,source
will be unchanged- Return type
None
-
update_content_hash
(exclude_fields=('id', 'chunks', 'matches', 'content_hash', 'parent_id'), include_fields=None)[source]¶ Update the document hash according to its content.
- Parameters
exclude_fields (
Optional
[Tuple
[str
]]) – a tuple of field names that excluded when computing content hashinclude_fields (
Optional
[Tuple
[str
]]) – a tuple of field names that included when computing content hash
Note
“exclude_fields” and “include_fields” are mutually exclusive, use one only
- Return type
None
-
property
id
¶ The document id in hex string, for non-binary environment such as HTTP, CLI, HTML and also human-readable. it will be used as the major view.
- Return type
str
- Returns
the id from the proto
-
property
parent_id
¶ The document’s parent id in hex string, for non-binary environment such as HTTP, CLI, HTML and also human-readable. it will be used as the major view.
- Return type
str
- Returns
the parent id from the proto
-
property
blob
¶ Return
blob
, one of the content form of a Document.Note
Use
content
to return the content of a Document- Return type
ndarray
- Returns
the blob content from the proto
-
property
embedding
¶ Return
embedding
of the content of a Document.- Return type
ndarray
- Returns
the embedding from the proto
-
property
matches
¶ Get all matches of the current document.
- Return type
- Returns
the set of matches attached to this document
-
property
chunks
¶ Get all chunks of the current document.
- Return type
- Returns
the set of chunks of this document
-
set_attrs
(**kwargs)[source]¶ Bulk update Document fields with key-value specified in kwargs
See also
get_attrs()
for bulk get attributes- Parameters
kwargs – the keyword arguments to set the values, where the keys are the fields to set
-
get_attrs
(*args)[source]¶ Bulk fetch Document fields and return a dict of the key-value pairs
See also
update()
for bulk set/update attributesNote
Arguments will be extracted using dunder_get .. highlight:: python .. code-block:: python
d = Document({‘id’: ‘123’, ‘hello’: ‘world’, ‘tags’: {‘id’: ‘external_id’, ‘good’: ‘bye’}})
assert d.id == ‘123’ # true assert d.tags[‘hello’] == ‘world’ # true assert d.tags[‘good’] == ‘bye’ # true assert d.tags[‘id’] == ‘external_id’ # true
res = d.get_attrs(*[‘id’, ‘tags__hello’, ‘tags__good’, ‘tags__id’])
assert res[‘id’] == ‘123’ # true assert res[‘tags__hello’] == ‘world’ # true assert res[‘tags__good’] == ‘bye’ # true assert res[‘tags__id’] == ‘external_id’ # true
- Parameters
args – the variable length values to extract from the document
- Return type
Dict
[str
,Any
]- Returns
a dictionary mapping the fields in :param:args to the actual attributes of this document
-
get_attrs_values
(*args)[source]¶ Bulk fetch Document fields and return a list of the values of these fields
Note
Arguments will be extracted using dunder_get .. highlight:: python .. code-block:: python
d = Document({‘id’: ‘123’, ‘hello’: ‘world’, ‘tags’: {‘id’: ‘external_id’, ‘good’: ‘bye’}})
assert d.id == ‘123’ # true assert d.tags[‘hello’] == ‘world’ # true assert d.tags[‘good’] == ‘bye’ # true assert d.tags[‘id’] == ‘external_id’ # true
res = d.get_attrs_values(*[‘id’, ‘tags__hello’, ‘tags__good’, ‘tags__id’])
assert res == [‘123’, ‘world’, ‘bye’, ‘external_id’]
- Parameters
args – the variable length values to extract from the document
- Return type
List
[Any
]- Returns
a list with the attributes of this document ordered as the args
-
property
buffer
¶ Return
buffer
, one of the content form of a Document.Note
Use
content
to return the content of a Document- Return type
bytes
- Returns
the buffer bytes from this document
-
property
text
¶ Return
text
, one of the content form of a Document.Note
Use
content
to return the content of a Document- Returns
the text from this document content
-
property
uri
¶ Return the URI of the document.
- Return type
str
- Returns
the uri from this document proto
-
property
mime_type
¶ Get MIME type of the document
- Return type
str
- Returns
the mime_type from this document proto
-
property
content_type
¶ Return the content type of the document, possible values: text, blob, buffer
- Return type
str
- Returns
the type of content present in this document proto
-
property
content
¶ Return the content of the document. It checks whichever field among
blob
,text
,buffer
has value and return it.- Return type
~DocumentContentType
- Returns
the value of the content depending on :meth:`content_type
-
property
granularity
¶ Return the granularity of the document.
- Returns
the granularity from this document proto
-
property
adjacency
¶ Return the adjacency of the document.
- Returns
the adjacency from this document proto
-
property
score
¶ Return the score of the document.
- Returns
the score attached to this document as :class:NamedScore
-
convert_buffer_to_blob
(**kwargs)[source]¶ Assuming the
buffer
is a _valid_ buffer of Numpy ndarray, setblob
accordingly.- Parameters
kwargs – reserved for maximum compatibility when using with ConvertDriver
Note
One can only recover values not shape information from pure buffer.
-
convert_buffer_image_to_blob
(color_axis=- 1, **kwargs)[source]¶ Convert an image buffer to blob
- Parameters
color_axis (
int
) – the axis id of the color channel,-1
indicates the color channel info at the last axiskwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_blob_to_uri
(width, height, resize_method='BILINEAR', **kwargs)[source]¶ Assuming
blob
is a _valid_ image, seturi
accordingly :type width:int
:param width: the width of the blob :type height:int
:param height: the height of the blob :type resize_method:str
:param resize_method: the resize method name :param kwargs: reserved for maximum compatibility when using with ConvertDriver
-
convert_uri_to_blob
(color_axis=- 1, uri_prefix=None, **kwargs)[source]¶ Convert uri to blob
- Parameters
color_axis (
int
) – the axis id of the color channel,-1
indicates the color channel info at the last axisuri_prefix (
Optional
[str
]) – the prefix of the urikwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_data_uri_to_blob
(color_axis=- 1, **kwargs)[source]¶ Convert data URI to image blob
- Parameters
color_axis (
int
) – the axis id of the color channel,-1
indicates the color channel info at the last axiskwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_uri_to_buffer
(**kwargs)[source]¶ Convert uri to buffer Internally it downloads from the URI and set
buffer
.- Parameters
kwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_uri_to_data_uri
(charset='utf-8', base64=False, **kwargs)[source]¶ Convert uri to data uri. Internally it reads uri into buffer and convert it to data uri
- Parameters
charset (
str
) – charset may be any character set registered with IANAbase64 (
bool
) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.kwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_buffer_to_uri
(charset='utf-8', base64=False, **kwargs)[source]¶ Convert buffer to data uri. Internally it first reads into buffer and then converts it to data URI.
- Parameters
charset (
str
) – charset may be any character set registered with IANAbase64 (
bool
) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.kwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_text_to_uri
(charset='utf-8', base64=False, **kwargs)[source]¶ Convert text to data uri.
- Parameters
charset (
str
) – charset may be any character set registered with IANAbase64 (
bool
) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.kwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_uri_to_text
(**kwargs)[source]¶ Assuming URI is text, convert it to text
- Parameters
kwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_content_to_uri
(**kwargs)[source]¶ Convert content in URI with best effort
- Parameters
kwargs – reserved for maximum compatibility when using with ConvertDriver
-
MergeFrom
(doc)[source]¶ Merge the content of target :param:doc into current document.
- Parameters
doc (
Document
) – the document to merge from
-
CopyFrom
(doc)[source]¶ Copy the content of target :param:doc into current document.
- Parameters
doc (
Document
) – the document to copy from
-
plot
(output=None, inline_display=False)[source]¶ Visualize the Document recursively.
- Parameters
output (
Optional
[str
]) – a filename specifying the name of the image to be created, the suffix svg/jpg determines the file type of the output imageinline_display (
bool
) – show image directly inside the Jupyter Notebook
- Return type
None
-
property
non_empty_fields
¶ Return the set fields of the current document that are not empty
- Return type
Tuple
[str
]- Returns
the tuple of non-empty fields