# DocumentArrayMemmap¶

When a DocumentArray object contains a large number of Documents, holding it in memory can be very demanding, DocumentArrayMemmap is a drop-in replacement of DocumentArray in this scenario.

Important

DocumentArrayMemmap shares almost the same API as DocumentArray besides insert, inplace reverse, inplace sort.

## How does it work?¶

A DocumentArrayMemmap stores all Documents directly on disk, while keeping a small lookup table in memory and a buffer pool of Documents with a fixed size. The lookup table contains the offset and length of each Document so it is much smaller than the full DocumentArray. Elements are loaded on-demand to memory during access. Memory-loaded Documents are kept in the buffer pool to allow modifying Documents.

## Construct¶

from jina import DocumentArrayMemmap

dam = DocumentArrayMemmap()  # use a local temporary folder as storage
dam2 = DocumentArrayMemmap('./my-memmap')  # use './my-memmap' as storage


## Delete¶

To delete all contents in a DocumentArrayMemmap object, simply call .clear(). It will clean all content on the disk.

You can also check the disk usage of a DocumentArrayMemmap by .physical_size property.

## Convert to/from DocumentArray¶

from jina import Document, DocumentArray, DocumentArrayMemmap

da = DocumentArray([Document(text='hello'), Document(text='world')])

# convert DocumentArray to DocumentArrayMemmap
dam = DocumentArrayMemmap()
dam.extend(da)

# convert DocumentArrayMemmap to DocumentArray
da = DocumentArray(dam)


Warning

DocumentArrayMemmap is in general used for one-way access, either read-only or write-only. Interleaving reading and writing on a DocumentArrayMemmap is not safe and not recommended in production.

### Understand buffer pool¶

Recently added, modified or accessed Documents are kept in an in-memory buffer pool. This allows all changes to Documents to be applied first in memory and then be persisted to disk in a lazy way (i.e. when they quit the buffer pool or when the dam object’s destructor is called). If you want to instantly persist the changed Documents, you can call .flush().

The number can be configured with the constructor argument buffer_pool_size (1,000 by default). Only the buffer_pool_size most recently accessed, modified or added Documents exist in the pool. Replacement of Documents follows the LRU strategy.

from jina import DocumentArrayMemmap

dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)


Warning

The buffer pool ensures that in-memory modified Documents are persisted to disk. Therefore, you should not reference Documents manually and modify them if they might be outside of the buffer pool. The next section explains the best practices when modifying Documents.

### Modify elements¶

Modifying elements of a DocumentArrayMemmap is possible because accessed and modified Documents are kept in the buffer pool:

from jina import DocumentArrayMemmap, Document

d1 = Document(text='hello')
d2 = Document(text='world')

dam = DocumentArrayMemmap('./my-memmap')
dam.extend([d1, d2])

dam[0].text = 'goodbye'

print(dam[0].text)

goodbye


However, there are practices to avoid: Mainly, you should not modify Documents that you reference manually and that might not be in the buffer pool. Here are some practices to avoid:

1. Keep more references than the buffer pool size and modify them:

from jina import Document, DocumentArrayMemmap

docs = [Document(text='hello') for _ in range(100)]
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
dam.extend(docs)
for doc in docs:
doc.text = 'goodbye'

dam[50].text

hello


Use the dam object to modify instead:

from jina import Document, DocumentArrayMemmap

docs = [Document(text='hello') for _ in range(100)]
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
dam.extend(docs)
for doc in dam:
doc.text = 'goodbye'

dam[50].text

goodbye


It’s also okay if you reference Documents less than the buffer pool size:

from jina import Document, DocumentArrayMemmap

docs = [Document(text='hello') for _ in range(100)]
dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=1000)
dam.extend(docs)
for doc in docs:
doc.text = 'goodbye'

dam[50].text

goodbye

2. Modify a reference that might have left the buffer pool:

from jina import Document, DocumentArrayMemmap

dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
my_doc = Document(text='hello')
dam.append(my_doc)

# my_doc leaves the buffer pool after extend
dam.extend([Document(text='hello') for _ in range(99)])
my_doc.text = 'goodbye'
dam[0].text

hello


Get the Document from the dam object and then modify it:

from jina import Document, DocumentArrayMemmap

dam = DocumentArrayMemmap('./my-memmap', buffer_pool_size=10)
my_doc = Document(text='hello')
dam.append(my_doc)

# my_doc leaves the buffer pool after extend
dam.extend([Document(text='hello') for _ in range(99)])
dam[my_doc.id].text = 'goodbye' # or dam[0].text = 'goodbye'
dam[0].text

goodbye


To summarize, it’s a best practice to rely on the dam object to reference the Documents that you modify.

### Maintain consistency¶

Considering two DocumentArrayMemmap objects that share the same on-disk storage ./memmap but sit in different processes/threads. After some write operations, the consistency of the lookup table and the buffer pool may be corrupted, as each DocumentArrayMemmap object has its own version of the lookup table and buffer pool in memory. .reload() and .flush() solve this issue:

from jina import Document, DocumentArrayMemmap

d1 = Document(text='hello')
d2 = Document(text='world')

dam = DocumentArrayMemmap('./my-memmap')
dam2 = DocumentArrayMemmap('./my-memmap')

dam.extend([d1, d2])
assert len(dam) == 2
assert len(dam2) == 0

assert len(dam2) == 2

dam.clear()
assert len(dam) == 0
assert len(dam2) == 2

assert len(dam2) == 0


You don’t need to call .flush() if you add new Documents. However, if you modified an attribute of a Document, you need to use it:

from jina import Document, DocumentArrayMemmap

d1 = Document(text='hello')

dam = DocumentArrayMemmap('./my-memmap')
dam2 = DocumentArrayMemmap('./my-memmap')

dam.append(d1)
d1.text = 'goodbye'
assert len(dam) == 1
assert len(dam2) == 0