Text

Text is everywhere and it is easily accessible. Neural search with text is probably the first application one can think of. From fuzzy string matching to question-answering, you can leverage Jina to build all kinds of text-based neural search solution in just minutes. By leveraging the state-of-the-art natural language processing and pretrained models, you can easily use Jina to bring the text intelligence of your app to the next level.

In this chapter, we provide some tutorials to help you get started with different text-related tasks. But before that, let’s recap our knowledge on Document and see how in general Jina is able to handle text data.

Textual document

Representing text in Jina is easy. Simply do:

from jina import Document

d = Document(text='hello, world.')
{'id': '1b00cab2-3738-11ec-a7d6-1e008a366d48', 'mime_type': 'text/plain', 'text': 'hello, world.'}

If your text data is big and can not be written inline, or it comes from a URI, then you can also define uri first and load the text into Document later.

from jina import Document

d = Document(uri='https://www.w3.org/History/19921103-hypertext/hypertext/README.html')
d.load_uri_to_text()
{'id': 'c558c262-3738-11ec-861b-1e008a366d48', 'uri': 'https://www.w3.org/History/19921103-hypertext/hypertext/README.html', 'mime_type': 'text/plain', 'text': '<TITLE>Read Me</TITLE>\n<NEXTID 7>\n<H1>WorldWideWeb distributed code</H1>See the CERN <A NAME=2 HREF=Copyright.html>copyright</A> .  This is the README file which you get when\nyou unwrap one of our tar files. These files contain information about\nhypertext, hypertext systems, and the WorldWideWeb project. If you\nhave taken this with a .tar file, you will have only a subset of the\nfiles.<P>\nTHIS FILE IS A VERY ABRIDGED VERSION OF THE INFORMATION AVAILABLE\nON THE WEB.   IF IN DOUBT, READ THE WEB DIRECTLY. If you have not\ngot any browser installed, do this by telnet to info.cern.ch (no username\nor password).\n<H2>Archive Directory structure</...'}

And of course, you can have characters from different languages.

from jina import Document

d = Document(text='👋	नमस्ते दुनिया!	你好世界!こんにちは世界!	Привет мир!')
{'id': '225f7134-373b-11ec-8373-1e008a366d48', 'mime_type': 'text/plain', 'text': '👋\tनमस्ते दुनिया!\t你好世界!こんにちは世界!\tПривет мир!'}

Segment long documents

Often times when you index/search textual document, you don’t want to consider thousands of words as one document, some finer granularity would be nice. You can do these by leveraging chunks of Document. For example, let’s segment this simple document by ! mark:

from jina import Document

d = Document(text='👋	नमस्ते दुनिया!	你好世界!こんにちは世界!	Привет мир!')

d.chunks.extend([Document(text=c) for c in d.text.split('!')])
{'id': '6a863d84-373c-11ec-97cc-1e008a366d48', 'chunks': [{'id': '6a864158-373c-11ec-97cc-1e008a366d48', 'mime_type': 'text/plain', 'text': '👋\tनमस्ते दुनिया', 'granularity': 1, 'parent_id': '6a863d84-373c-11ec-97cc-1e008a366d48'}, {'id': '6a864202-373c-11ec-97cc-1e008a366d48', 'mime_type': 'text/plain', 'text': '\t你好世界', 'granularity': 1, 'parent_id': '6a863d84-373c-11ec-97cc-1e008a366d48'}, {'id': '6a8642a2-373c-11ec-97cc-1e008a366d48', 'mime_type': 'text/plain', 'text': 'こんにちは世界', 'granularity': 1, 'parent_id': '6a863d84-373c-11ec-97cc-1e008a366d48'}, {'id': '6a864324-373c-11ec-97cc-1e008a366d48', 'mime_type': 'text/plain', 'text': '\tПривет мир', 'granularity': 1, 'parent_id': '6a863d84-373c-11ec-97cc-1e008a366d48'}, {'id': '6a8643a6-373c-11ec-97cc-1e008a366d48', 'mime_type': 'text/plain', 'text': '', 'granularity': 1, 'parent_id': '6a863d84-373c-11ec-97cc-1e008a366d48'}], 'mime_type': 'text/plain', 'text': '👋\tनमस्ते दुनिया!\t你好世界!こんにちは世界!\tПривет мир!'}

Which creates five sub-documents under the original documents and stores them under .chunks. To see that more clearly, you can visualize it via d.plot()

../../_images/sample-chunks.svg

Convert text into ndarray

Sometimes you may need to encode the text into a numpy.ndarray before further computation. We provide some helper functions in Document and DocumentArray that allow you to convert easily.

For example, we have a DocumentArray with three Document:

from jina import DocumentArray, Document

da = DocumentArray([Document(text='hello world'), Document(text='goodbye world'), Document(text='hello goodbye')])

To get the vocabulary, you can use:

vocab = da.get_vocabulary()
{'hello': 2, 'world': 3, 'goodbye': 4}

The vocabulary is 2-indexed as 0 is reserved for padding symbol and 1 is reserved for unknown symbol.

One can further use this vocabulary to convert .text field into .blob via:

for d in da:
    d.convert_text_to_blob(vocab)
    print(d.blob)
[2 3]
[4 3]
[2 4]

When you have text in different length and you want the output .blob to have the same length, you can define max_length during converting:

da = DocumentArray([Document(text='a short phrase'), Document(text='word'), Document(text='this is a much longer sentence')])
vocab = da.get_vocabulary()

for d in da:
    d.convert_text_to_blob(vocab, max_length=10)
    print(d.blob)
[0 0 0 0 0 0 0 2 3 4]
[0 0 0 0 0 0 0 0 0 5]
[ 0  0  0  0  6  7  2  8  9 10]

You can get also use .blobs of DocumentArray to get all blobs in one ndarray.

print(da.blobs)
[[ 0  0  0  0  0  0  0  2  3  4]
 [ 0  0  0  0  0  0  0  0  0  5]
 [ 0  0  0  0  6  7  2  8  9 10]]

Convert ndarray back to text

As a bonus, you can also easily convert an integer ndarray back to text based on some given vocabulary. This procedure is often termed as “decoding”.

da = DocumentArray([Document(text='a short phrase'), Document(text='word'), Document(text='this is a much longer sentence')])
vocab = da.get_vocabulary()

# encoding
for d in da:
    d.convert_text_to_blob(vocab, max_length=10)

# decoding
for d in da:
    d.convert_blob_to_text(vocab)
    print(d.text)
a short phrase
word
this is a much longer sentence

That’s all you need to know for textual data. Good luck with building text search solution in Jina!