jina.types.document.generators module

jina.types.document.generators.from_ndarray(array, axis=0, size=None, shuffle=False)[source]

Create a generator for a given dimension of a numpy array.

Parameters
  • array (np.ndarray) – the numpy ndarray data source

  • axis (int) – iterate over that axis

  • size (Optional[int]) – the maximum number of the sub arrays

  • shuffle (bool) – shuffle the numpy data source beforehand

Yield

ndarray

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_files(patterns, recursive=True, size=None, sampling_rate=None, read_mode=None)[source]

Creates an iterator over a list of file path or the content of the files.

Parameters
  • patterns (Union[str, List[str]]) – The pattern may contain simple shell-style wildcards, e.g. ‘*.py’, ‘[*.zip, *.gz]’

  • recursive (bool) – If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectories

  • size (Optional[int]) – the maximum number of the files

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

  • read_mode (Optional[str]) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary mode. If read_mode is None, will iterate over filenames.

Yield

file paths or binary content

Note

This function should not be directly used, use Flow.index_files(), Flow.search_files() instead

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_csv(fp, field_resolver=None, size=None, sampling_rate=None)[source]

Generator function for CSV. Yields documents.

Parameters
  • fp (Iterable[str]) – file paths

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield

documents

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_ndjson(fp, field_resolver=None, size=None, sampling_rate=None)[source]

Generator function for line separated JSON. Yields documents.

Parameters
  • fp (Iterable[str]) – file paths

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield

documents

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_lines(lines=None, filepath=None, read_mode='r', line_format='json', field_resolver=None, size=None, sampling_rate=None)[source]

Generator function for lines, json and csv. Yields documents or strings.

Parameters
  • lines (Optional[Iterable[str]]) – a list of strings, each is considered as a document

  • filepath (Optional[str]) – a text file that each line contains a document

  • read_mode (str) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary

  • line_format (str) – the format of each line json or csv

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield

documents

Return type

Generator[ForwardRef, None, None]