jina.types.document¶
-
class
jina.types.document.
Document
(document=None, copy=False, **kwargs)[source]¶ Bases:
object
Document
is one of the primitive data type in Jina.It offers a Pythonic interface to allow users access and manipulate
jina.jina_pb2.DocumentProto
object without working with Protobuf itself.To create a
Document
object, simply:from jina import Document d = Document() d.text = 'abc'
Jina requires each Document to have a string id. You can set a custom one, or if non has been set a random one will be assigned.
Or you can use
Document
as a context manager:with Document() as d: d.text = 'hello' assert d.id # now `id` has value
To access and modify the content of the document, you can use
text
,blob
, andbuffer
. Each property is implemented with proper setter, to improve the integrity and user experience. For example, assigningdoc.blob
ordoc.embedding
can be simply done via:import numpy as np # to set as content d.content = np.random.random([10, 5]) # to set as embedding d.embedding = np.random.random([10, 5])
MIME type is auto set/guessed when setting
content
anduri
Document
also provides multiple way to build from existing Document. You can buildDocument
fromjina_pb2.DocumentProto
,bytes
,str
, andDict
. You can also use it as view (i.e. weak reference when building from an existingjina_pb2.DocumentProto
). For example,a = DocumentProto() b = Document(a, copy=False) a.text = 'hello' assert b.text == 'hello'
You can leverage the
convert_a_to_b()
interface to convert between content forms.- Parameters
document (
Optional
[~DocumentSourceType]) – the document to construct from. Ifbytes
is given then deserialize aDocumentProto
;dict
is given then parse aDocumentProto
from it;str
is given, then consider it as a JSON string and parse aDocumentProto
from it; finally, one can also give DocumentProto directly, then depending on thecopy
, it builds a view or a copy from it.copy (
bool
) – whendocument
is given as aDocumentProto
object, build a view (i.e. weak reference) from it or a deep copy from it.kwargs – other parameters to be set
-
property
length
¶ - Return type
int
-
property
weight
¶ Returns the weight of the document
- Return type
float
-
property
modality
¶ Get the modality of the document
- Return type
str
-
property
content_hash
¶
-
update_content_hash
(exclude_fields=('id', 'chunks', 'matches', 'content_hash', 'parent_id'), include_fields=None)[source]¶ Update the document hash according to its content.
- Parameters
exclude_fields (
Optional
[Tuple
[str
]]) – a tuple of field names that excluded when computing content hashinclude_fields (
Optional
[Tuple
[str
]]) – a tuple of field names that included when computing content hash
Note
“exclude_fields” and “include_fields” are mutually exclusive, use one only
- Return type
None
-
property
id
¶ The document id in hex string, for non-binary environment such as HTTP, CLI, HTML and also human-readable. it will be used as the major view.
- Return type
-
property
parent_id
¶ The document’s parent id in hex string, for non-binary environment such as HTTP, CLI, HTML and also human-readable. it will be used as the major view.
- Return type
-
property
blob
¶ Return
blob
, one of the content form of a Document.Note
Use
content
to return the content of a Document- Return type
ndarray
-
property
embedding
¶ Return
embedding
of the content of a Document.- Return type
ndarray
-
set_attrs
(**kwargs)[source]¶ Bulk update Document fields with key-value specified in kwargs
See also
get_attrs()
for bulk get attributes
-
get_attrs
(*args)[source]¶ Bulk fetch Document fields and return a dict of the key-value pairs
See also
update()
for bulk set/update attributes- Return type
Dict
[str
,Any
]
-
property
as_pb_object
¶ - Return type
DocumentProto
-
property
buffer
¶ Return
buffer
, one of the content form of a Document.Note
Use
content
to return the content of a Document- Return type
bytes
-
property
text
¶ Return
text
, one of the content form of a Document.Note
Use
content
to return the content of a Document
-
property
uri
¶ - Return type
str
-
property
mime_type
¶ Get MIME type of the document
- Return type
str
-
property
content_type
¶ Return the content type of the document, possible values: text, blob, buffer
- Return type
str
-
property
content
¶ Return the content of the document. It checks whichever field among
blob
,text
,buffer
has value and return it.- Return type
~DocumentContentType
-
property
granularity
¶
-
property
score
¶
-
convert_buffer_to_blob
(**kwargs)[source]¶ Assuming the
buffer
is a _valid_ buffer of Numpy ndarray, setblob
accordingly.- Parameters
kwargs – reserved for maximum compatibility when using with ConvertDriver
Note
One can only recover values not shape information from pure buffer.
-
convert_uri_to_buffer
(**kwargs)[source]¶ Convert uri to buffer Internally it downloads from the URI and set
buffer
.- Parameters
kwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_uri_to_data_uri
(charset='utf-8', base64=False, **kwargs)[source]¶ Convert uri to data uri. Internally it reads uri into buffer and convert it to data uri
- Parameters
charset (
str
) – charset may be any character set registered with IANAbase64 (
bool
) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.kwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_buffer_to_uri
(charset='utf-8', base64=False, **kwargs)[source]¶ Convert buffer to data uri. Internally it first reads into buffer and then converts it to data URI.
- Parameters
charset (
str
) – charset may be any character set registered with IANAbase64 (
bool
) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.kwargs – reserved for maximum compatibility when using with ConvertDriver
-
convert_text_to_uri
(charset='utf-8', base64=False, **kwargs)[source]¶ Convert text to data uri.
- Parameters
charset (
str
) – charset may be any character set registered with IANAbase64 (
bool
) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit.
Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters. :param kwargs: reserved for maximum compatibility when using with ConvertDriver
-
convert_uri_to_text
(**kwargs)[source]¶ Assuming URI is text, convert it to text
- Parameters
kwargs – reserved for maximum compatibility when using with ConvertDriver