Using Sparse Embeddings in Jina

Motivation

A sparse matrix is a special case of a matrix in which the number of zero elements is much higher than the number of non-zero elements. The space used for representing data and the time for scanning the matrix can be reduced significantly using a sparse representation. In this guideline, we’ll introduce how to use a sparse matrix in Jina.

Before you start

Before you begin, make sure you meet these prerequisites:

And have installed Jina with Hub compounds on your machine:

pip install "jina[hub]"

How does Jina handle sparse matrices?

As a framework of search, Jina does not have a native sparse matrix support. The Ndarray.sparse module in Jina’s Primitive Types is an adapter between Jina and other sparse backends, such as Scipy.sparse. In Jina, we support three backends to create your sparse matrix/Tensor: Scipy, TensorFlow and Pytorch. You might noticed that Scipy.sparse supports different sparse formats, while Jina only supports COO, BSR, CSR and CSC.

When creating your own sparse matrix, we suggest you use CSR as default matrix type.

Sparse Matrix Formats

ShortName

FullName

Scipy

TensorFlow

Pytorch

COO

COOrdinate format

Yes

Yes

Yes

BSR

Block Sparse Row matrix

Yes

No

No

CSC

Compressed Sparse Column matrix

Yes

No

No

CSR

Compressed Sparse Row matrix

Yes

No

No

DIA

Sparse matrix with DIAgonal storage

Yes

No

No

DOK

Dictionary Of Keys based sparse matrix.

Yes

No

No

LIL

Row-based list of lists sparse matrix

Yes

No

No

How to build a sparse pipeline in Jina

In this pipeline, we will make use of Jina’s TFIDFTextEncoder together with PysparnnIndexer for encoding and indexing. You should begin by creating a clean folder that will store your python and YAML files. In the subsequent steps of this guide, you will create the following files.

project
├── tfidf_vectorizer.pickle
├── encode.yml
├── index.yml
├── flow_index.yml
├── flow_query.yml
├── __init__.py
├── app.py
├── fit_vectorizer.py

Step 1. Vectorize your data into sparse vector encoding

As was mentioned before, TFIDFTextEncoder was created based on Scikit-learn, before using the Encoder, you need to fit the vectorizer with your training data. In this example, we use a simple corpus containing four sentences of text.

# fit_vectorizer.py
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
# Dump the vectorizer fitted on your training data.
pickle.dump(vectorizer, open('./tfidf_vectorizer.pickle', 'wb'))

Step 2. Setup Encoder & Indexer YAML configuration

Create a YAML file with the following code snippet. This imports the tfidf_encoder from our Jina Hub and links it to the pickled vectorizer. The file should be named encode.yml.

!TFIDFTextEncoder
metas:
  name: tfidf_encoder
with:
  path_vectorizer: ./tfidf_vectorizer.pickle

For the indexer, we will use the PysparnnIndexer with approximate nearest neighbor for sparse data. Since we want to store the indexed result, we combined PysparnnIndexer and BinaryPbIndexer together.

Create a second YAML file with the following code snippet. The file should be called index.yml

!CompoundIndexer
components:
  - !PysparnnIndexer
    with:
      prefix_filename: 'pysparnn'
    metas:
      name: vecidx
  - !BinaryPbIndexer
    with:
      index_filename: doc.gz
    metas:
      name: docidx
metas:
  name: doc_compound_indexer
  workspace: $WORKDIR

Step 3. Create your index flow

In this step, we create a third YAML file that contains our index flow. Copy the following code snippet and create a YAML file named flow_index.yml.

jtype: Flow
pods:
  encoder:
    uses: encode.yml
    show_exc_info: true
    timeout_ready: 600000
    read_only: true
  doc_indexer:
    uses: index.yml
    shards: 1
    separated_workspace: true

Step 4. Create your query flow

In this step, we create an fourth YAML file that contains our query flow. Copy the following code snippet and create a YAML file named flow_query.yml.

jtype: Flow
with:
  read_only: true
pods:
  encoder:
    uses: encode.yml
    timeout_ready: 600000
    read_only: true
  doc_indexer:
    uses: index.yml
    shards: 1
    separated_workspace: true
    timeout_ready: 100000

Step 5. Combine your flows and run Jina

Now you can run the whole project. Add the following code snippet to a python file and run!

from jina import Flow

def index_generator():
    """
    Data from which we create `Documents`.
    """
    import csv
    data_path = os.path.join(os.path.dirname(__file__), os.environ['JINA_DATA_PATH'])

    with open(data_path) as f:
        reader = csv.reader(f, delimiter='\t')
        for i, data in enumerate(reader):
            d = Document()
            d.tags['id'] = int(i)
            d.text = data[0]
            yield d

# Load index flow configuration and run the index flow.
f = Flow.load_config('flow_index.yml')
with f:
    f.index(input_fn=index_generator, request_size=16)

# Load query flow configuration and run the query flow.
f = Flow.load_config('flow_query.yml')
with f:
    f.search_lines(lines=['my query', ], top_k=3)

Define your own Jina Sparse Encoder

If you want to create a customized Encoder with Jina, for example, encode your data with Scipy COO matrix format, the code snippet blow shows how you could achieve it:

from scipy.sparse import coo_matrix
from jina.executors.encoders import BaseEncoder

class SimpleScipyCOOEncoder(BaseEncoder):

    def encode(self, content: 'np.ndarray', *args, **kwargs) -> Any:
        """Encode document content into `coo` format."""
        return coo_matrix(content)

Then we’re able to make use of the SimpleScipyCOOEncoder defined above, inside the Jina Index and Search Flow.

Use a Jina Sparse Indexer

In Jina, we’ve created several Indexers to help you encode your Document content into sparse format. You need to set the embedding_cls_type to determine which sparse type your indexer supports. For instance, PysparnnIndexer is a library for fast similarity search of Sparse Scipy vectors. In contains an algorithm that can be used to perform fast approximate search with sparse inputs. Developed by Facebook AI Research.

Limitations

It should be noted that sparse indexers in the hub do not support ACID features.

What’s Next

If you still have questions, feel free to submit an issue or post a message in our community slack channel .

To gain a deeper knowledge on the implementation of Jina’s primitive data types, you can find the source code here.