DocumentArrayMemmap

When a DocumentArray object contains a large number of Documents, holding it in memory can be very demanding, DocumentArrayMemmap is a drop-in replacement of DocumentArray in this scenario.

Important

DocumentArrayMemmap shares almost the same API as DocumentArray besides insert, inplace reverse, inplace sort.

How does it work?

A DocumentArrayMemmap stores all Documents directly on disk, while keeping a small lookup table in memory and a buffer pool of Documents with a fixed size. The lookup table contains the offset and length of each Document so it is much smaller than the full DocumentArray. Elements are loaded on-demand to memory during access. Memory-loaded Documents are kept in the buffer pool to allow modifying Documents.

Construct

from jina import DocumentArrayMemmap

dam = DocumentArrayMemmap()  # use a local temporary folder as storage
dam2 = DocumentArrayMemmap('./my-memmap')  # use './my-memmap' as storage

Delete

To delete all contents in a DocumentArrayMemmap object, simply call .clear(). It will clean all content on the disk.

You can also check the disk usage of a DocumentArrayMemmap by .physical_size property.

Convert to/from DocumentArray

from jina import Document, DocumentArray, DocumentArrayMemmap

da = DocumentArray([Document(text='hello'), Document(text='world')])

# convert DocumentArray to DocumentArrayMemmap
dam = DocumentArrayMemmap()
dam.extend(da)

# convert DocumentArrayMemmap to DocumentArray
da = DocumentArray(dam)

Advanced

Warning

DocumentArrayMemmap is in general used for one-way access, either read-only or write-only. Interleaving reading and writing on a DocumentArrayMemmap is not safe and not recommended in production.

Understand buffer pool

Recently added, modified or accessed Documents are kept in an in-memory buffer pool. This allows all changes to Documents to be applied first in memory and then be persisted to disk in a lazy way (i.e. when they quit the buffer pool or when the dam object’s destructor is called). If you want to instantly persist the changed Documents, you can call .flush().

The number can be configured with the constructor argument buffer_pool_size (1,000 by default). Only the buffer_pool_size most recently accessed, modified or added Documents exist in the pool. Replacement of Documents follows the LRU strategy.

from jina import DocumentArrayMemmap

dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)

Warning

The buffer pool ensures that in-memory modified Documents are persisted to disk. Therefore, you should not reference Documents manually and modify them if they might be outside of the buffer pool. The next section explains the best practices when modifying Documents.

Modify elements

Modifying elements of a DocumentArrayMemmap is possible because accessed and modified Documents are kept in the buffer pool:

from jina import DocumentArrayMemmap, Document

d1 = Document(text='hello')
d2 = Document(text='world')

dam = DocumentArrayMemmap('./my-memmap')
dam.extend([d1, d2])

dam[0].text = 'goodbye'

print(dam[0].text)
goodbye

However, there are practices to avoid: Mainly, you should not modify Documents that you reference manually and that might not be in the buffer pool. Here are some practices to avoid:

  1. Keep more references than the buffer pool size and modify them:

    from jina import Document, DocumentArrayMemmap 
    
    docs = [Document(text='hello') for _ in range(100)]
    dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
    dam.extend(docs)
    for doc in docs:
        doc.text = 'goodbye'
    
    dam[50].text
    
    hello
    

    Use the dam object to modify instead:

    from jina import Document, DocumentArrayMemmap
    
    docs = [Document(text='hello') for _ in range(100)]
    dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
    dam.extend(docs)
    for doc in dam:
        doc.text = 'goodbye'
    
    dam[50].text
    
    goodbye
    

    It’s also okay if you reference Documents less than the buffer pool size:

    from jina import Document, DocumentArrayMemmap
    
    docs = [Document(text='hello') for _ in range(100)]
    dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=1000)
    dam.extend(docs)
    for doc in docs:
        doc.text = 'goodbye'
    
    dam[50].text
    
    goodbye
    
  2. Modify a reference that might have left the buffer pool:

    from jina import Document, DocumentArrayMemmap
    
    dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
    my_doc = Document(text='hello')
    dam.append(my_doc)
    
    # my_doc leaves the buffer pool after extend
    dam.extend([Document(text='hello') for _ in range(99)])
    my_doc.text = 'goodbye'
    dam[0].text
    
    hello
    

    Get the Document from the dam object and then modify it:

    from jina import Document, DocumentArrayMemmap
    
    dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
    my_doc = Document(text='hello')
    dam.append(my_doc)
    
    # my_doc leaves the buffer pool after extend
    dam.extend([Document(text='hello') for _ in range(99)])
    dam[my_doc.id].text = 'goodbye' # or dam[0].text = 'goodbye'
    dam[0].text
    
    goodbye
    

To summarize, it’s a best practice to rely on the dam object to reference the Documents that you modify.

Maintain consistency

Considering two DocumentArrayMemmap objects that share the same on-disk storage ./memmap but sit in different processes/threads. After some write operations, the consistency of the lookup table and the buffer pool may be corrupted, as each DocumentArrayMemmap object has its own version of the lookup table and buffer pool in memory. .reload() and .flush() solve this issue:

from jina import Document, DocumentArrayMemmap

d1 = Document(text='hello')
d2 = Document(text='world')

dam = DocumentArrayMemmap('./my-memmap')
dam2 = DocumentArrayMemmap('./my-memmap')

dam.extend([d1, d2])
assert len(dam) == 2
assert len(dam2) == 0

dam2.reload()
assert len(dam2) == 2

dam.clear()
assert len(dam) == 0
assert len(dam2) == 2

dam2.reload()
assert len(dam2) == 0

You don’t need to call .flush() if you add new Documents. However, if you modified an attribute of a Document, you need to use it:

from jina import Document, DocumentArrayMemmap

d1 = Document(text='hello')

dam = DocumentArrayMemmap('./my-memmap')
dam2 = DocumentArrayMemmap('./my-memmap')

dam.append(d1)
d1.text = 'goodbye'
assert len(dam) == 1
assert len(dam2) == 0

dam2.reload()
assert len(dam2) == 1
assert dam2[0].text == 'hello'

dam.flush()
dam2.reload()
assert dam2[0].text == 'goodbye'