jina.types.arrays.mixins.text module

class jina.types.arrays.mixins.text.TextToolsMixin[source]

Bases: object

Help functions used in NLP for DA and DAM

get_vocabulary(min_freq=1, text_attrs=('text',))[source]

Get the text vocabulary in a dict that maps from the word to the index from all Documents.

  • text_attrs (Tuple[str, …]) – the textual attributes where vocabulary will be derived from

  • min_freq (int) – the minimum word frequency to be considered into the vocabulary.

Return type

Dict[str, int]


a vocabulary in dictionary where key is the word, value is the index. The value is 2-index, where 0 is reserved for padding, 1 is reserved for unknown token.