jina.types.document.generators module

jina.types.document.generators.from_ndarray(array, axis=0, size=None, shuffle=False)[source]

Create a generator for a given dimension of a numpy array.

Parameters
  • array (np.ndarray) – the numpy ndarray data source

  • axis (int) – iterate over that axis

  • size (Optional[int]) – the maximum number of the sub arrays

  • shuffle (bool) – shuffle the numpy data source beforehand

Yield

documents

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_files(patterns, recursive=True, size=None, sampling_rate=None, read_mode=None, to_dataturi=False)[source]

Creates an iterator over a list of file path or the content of the files.

Parameters
  • patterns (Union[str, List[str]]) – The pattern may contain simple shell-style wildcards, e.g. ‘*.py’, ‘[*.zip, *.gz]’

  • recursive (bool) – If recursive is true, the pattern ‘**’ will match any files and zero or more directories and subdirectories

  • size (Optional[int]) – the maximum number of the files

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

  • read_mode (Optional[str]) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary mode. If read_mode is None, will iterate over filenames.

  • to_dataturi (bool) – if set, then the Document.uri will be filled with DataURI instead of the plan URI

Yield

file paths or binary content

Note

This function should not be directly used, use Flow.index_files(), Flow.search_files() instead

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_csv(file, field_resolver=None, size=None, sampling_rate=None, dialect='excel')[source]

Generator function for CSV. Yields documents.

Parameters
  • file (Union[str, TextIO]) – file paths or file handler

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in JSON, dict to the field names defined in Document.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

  • dialect (Union[str, ForwardRef]) – define a set of parameters specific to a particular CSV dialect. could be a string that represents predefined dialects in your system, or could be a csv.Dialect class that groups specific formatting parameters together. If you don’t know the dialect and the default one does not work for you, you can try set it to auto.

Yield

documents

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_huggingface_datasets(dataset_path, field_resolver=None, size=None, sampling_rate=None, filter_fields=False, **datasets_kwargs)[source]

Generator function for Hugging Face Datasets. Yields documents.

This function helps to load datasets from Hugging Face Datasets Hub (https://huggingface.co/datasets) in Jina. Additional parameters can be passed to the datasets library using keyword arguments. The load_dataset method from datasets library is used to load the datasets.

Parameters
  • dataset_path (str) – a valid dataset path for Hugging Face Datasets library.

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

  • filter_fields (bool) – specifies whether to filter the dataset with the fields given in `field_resolver argument.

  • **datasets_kwargs

    additional arguments for load_dataset method from Datasets library. More details at https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset

Yield

documents

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_ndjson(fp, field_resolver=None, size=None, sampling_rate=None)[source]

Generator function for line separated JSON. Yields documents.

Parameters
  • fp (Iterable[str]) – file paths

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield

documents

Return type

Generator[ForwardRef, None, None]

jina.types.document.generators.from_lines(lines=None, filepath=None, read_mode='r', line_format='json', field_resolver=None, size=None, sampling_rate=None)[source]

Generator function for lines, json and csv. Yields documents or strings.

Parameters
  • lines (Optional[Iterable[str]]) – a list of strings, each is considered as a document

  • filepath (Optional[str]) – a text file that each line contains a document

  • read_mode (str) – specifies the mode in which the file is opened. ‘r’ for reading in text mode, ‘rb’ for reading in binary

  • line_format (str) – the format of each line json or csv

  • field_resolver (Optional[Dict[str, str]]) – a map from field names defined in document (JSON, dict) to the field names defined in Protobuf. This is only used when the given document is a JSON string or a Python dict.

  • size (Optional[int]) – the maximum number of the documents

  • sampling_rate (Optional[float]) – the sampling rate between [0, 1]

Yield

documents

Return type

Generator[ForwardRef, None, None]