Prevent Indexing Duplicates¶
When indexing documents, it is common to have duplicate documents received by the search system. One can either remove the duplicates before sending the duplicates to Jina or leave it to Jina for handling the duplicates.
To prevent indexing duplicates, one needs to add _unique
for the uses_before
option. For example,
-
Python API
¶ from jina.flow import Flow from jina.proto import jina_pb2 doc_0 = jina_pb2.Document() doc_0.text = f'I am doc0' doc_1 = jina_pb2.Document() doc_1.text = f'I am doc1' def assert_num_docs(rsp, num_docs): assert len(rsp.IndexRequest.docs) == num_docs f = Flow().add( uses='NumpyIndexer', uses_before='_unique') with f: f.index( [doc_0, doc_0, doc_1], output_fn=lambda rsp: assert_num_docs(rsp, num_docs=2))
Under the hood, the configuration yaml file, :file:executors._unique.yml
, under the :file:jina/resources
is used. The yaml file is defined as below
-
YAML spec
¶ !DocIDCache with: index_path: cache.tmp requests: on: [SearchRequest, TrainRequest, IndexRequest, ControlRequest]: - !RouteDriver {} IndexRequest: - !TaggingCacheDriver with: tags: is_indexed: true - !FilterQL with: lookups: {tags__is_indexed__neq: true}
jina.executors.indexers.cache.DocIdCache
uses document ID to detect the duplicates. The documents with the same ID are considered as the same one. jina.drivers.cache.TaggingCacheDriver
keep a set of the indexed keys and check against the cache for a hit. If the document id exists, jina.drivers.cache.TaggingCacheDriver
sets the customized keys in the tags field to the predefined value. In the above configuration, is_indexed
in the tags
field is set to true
when the document id hit the cached indexed keys. Afterwards, jina.drivers.querylang.filter.FilterQL
is used to filter out the duplicate documents from the request.
In Jina, the document ID is by default generated a new hexdigest based on the content of the document. The hexdigest is calcuated with blake2b algorithm. By setting override_doc_id=True
, users can also use customized document ids with Jina client and add tags
to map to their unique concepts.
Warning
When setting override_doc_id=True
, a customized id is only acceptable if
it is a hexadecimal string
it has an even length
Warning
Be careful when using _unique keyword as a cache executor, it will not set any workspace where to store actual data and it will use as workspace the folder where it runs, which may not be where the actual indexers store their data which can be inconvenient. If you want to store the cache in a specific workspace while keeping the same functionality, just copy the yaml description under jina/resources/executors._unique.yml and add the desired workspace under metas.
!DocIDCache
with:
index_path: cache.tmp
metas:
name: cache
workspace: $WORKSPACE
...