A Guide on MIME Types in Jina

This guide explains what a mime type is, how to assign them manually or automatically to a Document and how you can work with them on a Chunk or Driver level.

Summary

The MIME type of a Document is a string property (document.mime_type). It is used to store the type of the Document. This information can be used by the Drivers to handle Documents differently based on their MIME type. The MIME type of a Document can be set to one of the values in mimetypes.types_map.values() (e.g. video/mpeg, text/html, image/jpeg, application/msword…). It can be automatically derived from the content of the Document or being overwritten manually. The MIME type is set automatically when defining one of the content attributes uri, text or buffer.

Automatic MIME Type Assignment

The following example shows which MIME types are automatically derived from the content attributes:

auto_assignment.py
from jina import Document

d1 = Document()
d1.text = 'my text 📩'
assert d1.mime_type == 'text/plain'

d2 = Document()
d2.uri = 'https://upload.wikimedia.org/wikipedia/commons/a/a9/Example.jpg'
assert d2.mime_type == 'image/jpeg'

d3 = Document()
d3.buffer = (b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR...')
assert d3.mime_type == 'image/png'

Manual MIME Type Assignment

To have full control over the MIME type, it is also possible to set it manually as shown in the following example:

manual_assignment.py
from jina import Document

svg_content = (
    '<svg height="100" width="100">'
    '  <circle cx="50" cy="50" r="40" stroke="black" stroke-width="3" fill="red" />'
    '</svg>'
).encode('utf8')
d = Document()
d.buffer = (svg_content)
assert d.mime_type == 'image/svg'
# the MIME type can be overwritten
d.mime_type = 'text/plain'
# assigning an invalid MIME type leads to a ``ValueError``
d.mime_type = 'invalid/type' # raises exception

MIME Type in Chunks

Chunks can be created by Segmenters. Also, Chunks can be created by the user and attached to the Documents before feeding them into the flow. There are many use cases where Chunks have the same MIME type as their parent Documents. For instance, when segmenting images or audio, Chunks of the same MIME type are created.

In some use cases, a different parent and chunk mime type is required. Such as processing video, where the chunk would be images. A Segmenter is responsible for assigning the correct mime type to the chunks when they are created.

In case no MIME type is set, the SegmentDriver assigns the MIME type of the parent Document as default value. The following example shows a simple Segmenter, which sets the mime_type for each Chunk it creates.

dummy_segmenter.py
from jina.executors.segmenters import BaseSegmenter

class DummySegmenter(BaseSegmenter):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def segment(self, text: str, *args, **kwargs):
        results = [{
            'text': word,
            'mime_type': 'text/plain'
        } for word in text.split()]
        return results

Usage in Driver

Drivers can access the MIME type of the Document to handle them accordingly. The following Driver only encodes Documents where the mime_type is 'text/plain':

special_segment_driver.py
class EncodeTextDriver(...):
   def _apply_all(...) -> None:
       for doc in docs:
           if doc.mime_type == 'text/plain':
               embeds = self.exec_fn(contents)