jina.types.arrays.memmap module

class jina.types.arrays.memmap.DocumentArrayMemmap(path, key_length=36, buffer_pool_size=1000)[source]

Bases: jina.types.arrays.traversable.TraversableSequence, jina.types.arrays.document.DocumentArrayGetAttrMixin, jina.types.arrays.neural_ops.DocumentArrayNeuralOpsMixin, jina.types.arrays.search_ops.DocumentArraySearchOpsMixin, collections.abc.Iterable, jina.types.arrays.abstract.AbstractDocumentArray

Create a memory-map to an DocumentArray stored in binary files on disk.

Memory-mapped files are used for accessing Document of large DocumentArray on disk, without reading the entire file into memory.

The DocumentArrayMemmap on-disk storage consists of two files:
  • header.bin: stores id, offset, length and boundary info of each Document in body.bin;

  • body.bin: stores Documents continuously

When loading DocumentArrayMemmap, it loads the content of header.bin into memory, while storing all body.bin data on disk. As header.bin is often much smaller than body.bin, memory is saved.

DocumentArrayMemmap also loads a portion of the documents in a memory buffer and keeps the memory documents synced with the disk. This helps ensure that modified documents are persisted to the disk. The memory buffer size is configured with parameter buffer_pool_size which represents the number of documents that the buffer can store.

Note

To make sure the documents you modify are persisted to disk, make sure that the number of referenced documents does not exceed the buffer pool size. Otherwise, they won’t be referenced by the buffer pool and they will not be persisted. The best practice is to always reference documents using DAM.

This class is designed to work similarly as DocumentArray but differs in the following aspects:
  • you can set the attribute of elements in a DocumentArrayMemmap but you need to make sure that you

don’t reference more documents than the buffer pool size - each document

To convert between a DocumentArrayMemmap and a DocumentArray

# convert from DocumentArrayMemmap to DocumentArray
dam = DocumentArrayMemmap('./tmp')
...

da = DocumentArray(dam)

# convert from DocumentArray to DocumentArrayMemmap
dam2 = DocumentArrayMemmap('./tmp')
dam2.extend(da)
reload()[source]

Reload header of this object from the disk.

This function is useful when another thread/process modify the on-disk storage and the change has not been reflected in this DocumentArray object.

This function only reloads the header, not the body.

extend(values)[source]

Extend the DocumentArrayMemmap by appending all the items from the iterable.

Parameters

values (Iterable[Document]) – the iterable of Documents to extend this array with

Return type

None

clear()[source]

Clear the on-disk data of DocumentArrayMemmap

Return type

None

append(doc, flush=True, update_buffer=True)[source]

Append :param:`doc` in DocumentArrayMemmap.

Parameters
  • doc (Document) – The doc needs to be appended.

  • update_buffer (bool) – If set, update the buffer.

  • flush (bool) – If set, then flush to disk on done.

Return type

None

get_doc_by_key(key)[source]

returns a document by key (ID) from disk

Parameters

key (str) – id of the document

Returns

returns a document

save()[source]

Persists memory loaded documents to disk

Return type

None

prune()[source]

Prune deleted Documents from this object, this yields a smaller on-disk storage.

Return type

None

property physical_size: int

Return the on-disk physical size of this DocumentArrayMemmap, in bytes

Return type

int

Returns

the number of bytes

get_attributes(*fields)[source]

Return all nonempty values of the fields from all docs this array contains

Parameters

fields (str) – Variable length argument with the name of the fields to extract

Return type

Union[List, List[List]]

Returns

Returns a list of the values for these fields. When fields has multiple values, then it returns a list of list.

get_attributes_with_docs(*fields)[source]

Return all nonempty values of the fields together with their nonempty docs

Parameters

fields (str) – Variable length argument with the name of the fields to extract

Return type

Tuple[Union[List, List[List]], DocumentArray]

Returns

Returns a tuple. The first element is a list of the values for these fields. When fields has multiple values, then it returns a list of list. The second element is the non-empty docs.

property embeddings: numpy.ndarray

Return a np.ndarray stacking all the embedding attributes as rows.

Return type

ndarray

Returns

embeddings stacked per row as np.ndarray.

Warning

This operation assumes all embeddings have the same shape and dtype. All dtype and shape values are assumed to be equal to the values of the first element in the DocumentArray / DocumentArrayMemmap.

Warning

This operation currently does not support sparse arrays.

property tags: List[jina.types.struct.StructView]

Get the tags attribute of all Documents

Return type

List[StructView]

Returns

List of tags attributes for all Documents

property texts: List[str]

Get the text attribute of all Documents

Return type

List[str]

Returns

List of text attributes for all Documents

property buffers: List[bytes]

Get the buffer attribute of all Documents

Return type

List[bytes]

Returns

List of buffer attributes for all Documents

property blobs: numpy.ndarray

Return a np.ndarray stacking all the blob attributes.

The blob attributes are stacked together along a newly created first dimension (as if you would stack using np.stack(X, axis=0)).

Warning

This operation assumes all blobs have the same shape and dtype. All dtype and shape values are assumed to be equal to the values of the first element in the DocumentArray / DocumentArrayMemmap

Warning

This operation currently does not support sparse arrays.

Return type

ndarray

Returns

blobs stacked per row as np.ndarray.

property path: str

Get the path where the instance is stored.

Return type

str

Returns

The stored path of the memmap instance.