jina.types.document.mixins.text module

class jina.types.document.mixins.text.TextDataMixin[source]

Bases: object

Provide helper functions for Document to support text data.

load_uri_to_text(charset='utf-8')[source]

Convert uri to :attr`.text` inplace.

Parameters

charset (str) – charset may be any character set registered with IANA

Return type

~T

Returns

itself after processed

get_vocabulary(text_attrs=('text',))[source]

Get the text vocabulary in a counter dict that maps from the word to its frequency from all text_fields.

Parameters

text_attrs (Tuple[str, …]) – the textual attributes where vocabulary will be derived from

Return type

Dict[str, int]

Returns

a vocabulary in dictionary where key is the word, value is the frequency of that word in all text fields.

convert_text_to_blob(vocab, max_length=None, dtype='int64')[source]

Convert text to blob inplace.

In the end blob will be a 1D array where D is max_length.

To get the vocab of a DocumentArray, you can use jina.types.document.converters.build_vocab to

Parameters
  • vocab (Dict[str, int]) – a dictionary that maps a word to an integer index, 0 is reserved for padding, 1 is reserved for unknown words in text. So you should not include these two entries in vocab.

  • max_length (Optional[int]) – the maximum length of the sequence. Sequence longer than this are cut off from beginning. Sequence shorter than this will be padded with 0 from right hand side.

  • dtype (str) – the dtype of the generated blob

Return type

~T

Returns

Document itself after processed

convert_blob_to_text(vocab, delimiter=' ')[source]

Convert blob to text inplace.

Parameters
  • vocab (Union[Dict[str, int], Dict[int, str]]) – a dictionary that maps a word to an integer index, 0 is reserved for padding, 1 is reserved for unknown words in text

  • delimiter (str) – the delimiter that used to connect all words into text

Return type

~T

Returns

Document itself after processed

dump_text_to_datauri(charset='utf-8', base64=False)[source]

Convert text to data uri.

Parameters
  • charset (str) – charset may be any character set registered with IANA

  • base64 (bool) – used to encode arbitrary octet sequences into a form that satisfies the rules of 7bit. Designed to be efficient for non-text 8 bit and binary data. Sometimes used for text data that frequently uses non-US-ASCII characters.

Return type

~T

Returns

itself after processed

convert_uri_to_text(**kwargs)
convert_text_to_uri(**kwargs)