docarray.memmap package

Submodules

Module contents

class docarray.memmap.DocumentArrayMemmap(path=None, key_length=36, buffer_pool_size=1000)[source]

Bases: docarray.array.mixins.AllMixins, collections.abc.MutableSequence

Create a memory-map to an DocumentArray stored in binary files on disk.

Memory-mapped files are used for accessing Document of large DocumentArray on disk, without reading the entire file into memory.

The DocumentArrayMemmap on-disk storage consists of two files:
  • header.bin: stores id, offset, length and boundary info of each Document in body.bin;

  • body.bin: stores Documents continuously

When loading DocumentArrayMemmap, it loads the content of header.bin into memory, while storing all body.bin data on disk. As header.bin is often much smaller than body.bin, memory is saved.

DocumentArrayMemmap also loads a portion of the documents in a memory buffer and keeps the memory documents synced with the disk. This helps ensure that modified documents are persisted to the disk. The memory buffer size is configured with parameter buffer_pool_size which represents the number of documents that the buffer can store.

Note

To make sure the documents you modify are persisted to disk, make sure that the number of referenced documents does not exceed the buffer pool size. Otherwise, they won’t be referenced by the buffer pool and they will not be persisted. The best practice is to always reference documents using DAM.

This class is designed to work similarly as DocumentArray but differs in the following aspects:
  • you can set the attribute of elements in a DocumentArrayMemmap but you need to make sure that you

don’t reference more documents than the buffer pool size - each document

To convert between a DocumentArrayMemmap and a DocumentArray

# convert from DocumentArrayMemmap to DocumentArray
dam = DocumentArrayMemmap('./tmp')
...

da = DocumentArray(dam)

# convert from DocumentArray to DocumentArrayMemmap
dam2 = DocumentArrayMemmap('./tmp')
dam2.extend(da)
insert(index, doc)[source]

Insert doc at index.

Parameters
  • index (int) – the offset index of the insertion.

  • doc (Document) – the doc needs to be inserted.

Return type

None

reload()[source]

Reload header of this object from the disk.

This function is useful when another thread/process modify the on-disk storage and the change has not been reflected in this DocumentArray object.

This function only reloads the header, not the body.

extend(docs)[source]

Extend the DocumentArrayMemmap by appending all the items from the iterable.

Parameters

docs (Iterable[Document]) – the iterable of Documents to extend this array with

Return type

None

clear()[source]

Clear the on-disk data of DocumentArrayMemmap

Return type

None

append(doc, flush=True, update_buffer=True)[source]

Append doc in DocumentArrayMemmap.

Parameters
  • doc (Document) – The doc needs to be appended.

  • update_buffer (bool) – If set, update the buffer.

  • flush (bool) – If set, then flush to disk on done.

Return type

None

flush()[source]

Persists memory loaded documents to disk

Return type

None

prune()[source]

Prune deleted Documents from this object, this yields a smaller on-disk storage.

Return type

None

property physical_size: int

Return the on-disk physical size of this DocumentArrayMemmap, in bytes

Return type

int

Returns

the number of bytes

property path: str

Get the path where the instance is stored.

Return type

str

Returns

The stored path of the memmap instance.