Build a GPU Executor#

This document will show you how to use an Executor on a GPU, both locally and in a Docker container. You will also learn how to use GPU with pre-built Hub executors.

Using a GPU allows you to significantly speed up encoding for most deep learning models, reducing response latency by anything from 5 to 100 times, depending on the model and inputs used.

Important

This tutorial assumes you are already familiar with basic Jina concepts, such as Document, Executor, and Flow. Some knowledge of the Hub is also needed for the last part of the tutorial.

If you’re not yet familiar with these concepts, first read the Executor and Flow documentation, and return to this tutorial once you feel comfortable performing basic operations in Jina.

Jina & GPUs in a nutshell#

If you want a thorough walk-through of how to use GPU resources in your code, the full tutorial in the next section is exactly what you are looking for.

But if you already know how to use your GPU and have come here just to find out how to make it play nice with Jina, then we have good news for you:

You just use your GPU like you usually would in your machine learning framework of choice, and you are off to the races. Jina enables you to use GPUs like you normally would in a Python script, or in a Docker container - it does not impose any additional requirements or configuration.

Let’s take a look at a minimal working example, written in PyTorch.

import torch

from docarray import DocumentArray
from jina import Executor, requests


class MyGPUExec(Executor):
    def __init__(self, device: str = 'cpu', *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.device = device

    @requests
    def encode(self, docs: DocumentArray, **kwargs):
        with torch.inference_mode():
            # Generate random embeddings
            embeddings = torch.rand((len(docs), 5), device=self.device)
            docs.embeddings = embeddings
            embedding_device = 'GPU' if embeddings.is_cuda else 'CPU'
            docs.texts = [f'Embeddings calculated on {embedding_device}']

Use it with CPU

from docarray import Document
from jina import Flow

f = Flow().add(uses=MyGPUExec, uses_with={'device': 'cpu'})
docs = DocumentArray(Document())
with f:
    docs = f.post(on='/encode', inputs=docs)
print(f'Document embedding: {docs.embeddings}')
print(docs.texts)

           Flow@80[I]:🎉 Flow is ready to use!
	🔗 Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:49618
	🔒 Private network:	172.28.0.2:49618
	🌐 Public address:	34.67.105.220:49618
Document embedding: tensor([[0.1769, 0.1557, 0.9266, 0.8655, 0.6291]])
['Embeddings calculated on CPU']

Use it with GPU

from docarray import Document
from jina import Flow

f = Flow().add(uses=MyGPUExec, uses_with={'device': 'cuda'})
docs = DocumentArray(Document())
with f:
    docs = f.post(on='/encode', inputs=docs)
print(f'Document embedding: {docs.embeddings}')
print(docs.texts)

           Flow@80[I]:🎉 Flow is ready to use!
	🔗 Protocol: 		GRPC
	🏠 Local access:	0.0.0.0:56276
	🔒 Private network:	172.28.0.2:56276
	🌐 Public address:	34.67.105.220:56276
Document embedding: tensor([[0.6888, 0.8646, 0.0422, 0.8501, 0.4016]])
['Embeddings calculated on GPU']

Just like that, your code runs on GPU, inside a Jina Flow.

Next, we will go through a more fleshed out example in detail, where we use a language model to embed text in our documents - all on GPU, and thus blazingly fast.

Prerequisites#

For this tutorial, you will need to work on a machine with an NVIDIA graphics card. You can use various free cloud platforms (like Google Colab or Kaggle kernels), if you do not have such a machine at home.

You will also need to make sure to have a recent version of NVIDIA drivers installed. You don’t need to install CUDA for this tutorial, but note that depending on the deep learning framework that you use, that might be required (for local execution).

For the Docker part of the tutorial you will also need to have Docker and nvidia-docker installed.

To run Python scripts you will need a virtual environment (for example venv or conda), and to install Jina inside it using

pip install jina

Setting up the Executor#

Jina Hub

In this section we create an executor using Jina Hub. This still creates your executor locally and privately, but makes it quick and easy to run your executor inside a Docker container, or to publish it to the Hub later, should you so choose.

We will create a simple sentence encoder, and we’ll start by creating the Executor “skeleton” using Jina’s command line utility:

jina hub new

When prompted for inputs, name your encoder SentenceEncoder, and accept the default folder for it - this will create a SentenceEncoder/ folder inside your current directory, this will be our working directory for this tutorial.

Next, select y when prompted for advanced configuration, and leave all other questions empty, except when you are asked if you want to create a Dockerfile - answer y to this one (we will need it in the next section). In the end, you should be greeted with suggested next steps.

Next steps

╭────────────────────────────────────── 🎉 Next steps ───────────────────────────────────────╮
│                                                                                            │
│  Congrats! You have successfully created an Executor! Here are the next steps:             │
│  ╭──────────────────────── 1. Check out the generated Executor ─────────────────────────╮  │
│  │   1 cd /home/ubuntu/SentenceEncoder                                                  │  │
│  │   2 ls                                                                               │  │
│  ╰──────────────────────────────────────────────────────────────────────────────────────╯  │
│  ╭─────────────────────────── 2. Understand folder structure ───────────────────────────╮  │
│  │                                                                                      │  │
│  │   Filena…   Description                                                              │  │
│  │  ──────────────────────────────────────────────────────────────────────────────────  │  │
│  │   config…   The YAML config file of the Executor. You can define __init__ argumen…   │  │
│  │             ╭────────────────── config.yml ──────────────────╮                       │  │
│  │             │   1                                            │                       │  │
│  │             │   2 jtype: SentenceEncoder                     │                       │  │
│  │             │   3 with:                                      │                       │  │
│  │             │   4     foo: 1                                 │                       │  │
│  │             │   5     bar: hello                             │                       │  │
│  │             │   6 metas:                                     │                       │  │
│  │             │   7     py_modules:                            │                       │  │
│  │             │   8         - executor.py                      │                       │  │
│  │             │   9                                            │                       │  │
│  │             ╰────────────────────────────────────────────────╯                       │  │
│  │   Docker…   The Dockerfile describes how this executor will be built.                │  │
│  │   execut…   The main logic file of the Executor.                                     │  │
│  │   manife…   Metadata for the Executor, for better appeal on Jina Hub.                │  │
│  │                                                                                      │  │
│  │               Field   Description                                                    │  │
│  │              ────────────────────────────────────────────────────────────────────    │  │
│  │               name    Human-readable title of the Executor                           │  │
│  │               desc…   Human-readable description of the Executor                     │  │
│  │               url     URL to find more information on the Executor (e.g. GitHub…     │  │
│  │               keyw…   Keywords that help user find the Executor                      │  │
│  │                                                                                      │  │
│  │   README…   A usage guide of the Executor.                                           │  │
│  │   requir…   The Python dependencies of the Executor.                                 │  │
│  │                                                                                      │  │
│  ╰──────────────────────────────────────────────────────────────────────────────────────╯  │
│  ╭────────────────────────────── 3. Share it to Jina Hub ───────────────────────────────╮  │
│  │   1 jina hub push /home/ubuntu/SentenceEncoder                                       │  │
│  ╰──────────────────────────────────────────────────────────────────────────────────────╯  │
╰────────────────────────────────────────────────────────────────────────────────────────────╯

Once this is done, let’s move to the newly created Executor directory:

cd SentenceEncoder

Let’s continue by specifying our requirements in requirements.txt file

sentence-transformers==2.0.0

and installing them using

pip install -r requirements.txt

Do I need to install CUDA?

All machine learning frameworks rely on CUDA for running on GPU. However, whether you need CUDA installed on your system or not depends on the framework that you are using.

In this tutorial, we are using PyTorch framework, which already includes the necessary CUDA binaries in its distribution. However, other frameworks, such as Tensorflow, require you to install CUDA yourself.

Install only what you need

In this example we are installing the GPU-enabled version of PyTorch, which is the default version when installing from PyPI. However, if you know that you only need to use your executor on CPU, you can save a lot of space (100s of MBs, or even GBs) by installing CPU-only versions of your requirements. This translates into faster start-up times when using Docker containers.

In our case, we could change the requirements.txt file to install a CPU-only version of PyTorch like this

-f https://download.pytorch.org/whl/torch_stable.html sentence-transformers torch==1.9.0+cpu

Now let’s fill the executor.py file with the actual code of our Executor

from docarray import Document, DocumentArray
from jina import Executor, requests
from sentence_transformers import SentenceTransformer
import torch


class SentenceEncoder(Executor):
    """A simple sentence encoder that can be run on a CPU or a GPU

    :param device: The pytorch device that the model is on, e.g. 'cpu', 'cuda', 'cuda:1'
    """

    def __init__(self, device: str = 'cpu', *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
        self.model.to(device)  # Move the model to device

    @requests
    def encode(self, docs: DocumentArray, **kwargs):
        """Add text-based embeddings to all documents"""
        with torch.inference_mode():
            embeddings = self.model.encode(docs.texts, batch_size=32)
        docs.embeddings = embeddings

Here all the device-specific magic happens on the two highlighted lines - when we create the SentenceEncoder class instance we pass it the device, and then we move the PyTorch model to our device. These are the exact same steps that you would use in a standalone Python script as well.

To see how we would pass the device we want the Executor to use, let’s create another file - main.py, which will demonstrate the usage of this encoder by encoding 10 thousand text documents.

from docarray import Document
from jina import Flow

from executor import SentenceEncoder


def generate_docs():
    for _ in range(10_000):
        yield Document(
            text='Using a GPU allows you to significantly speed up encoding.'
        )


f = Flow().add(uses=SentenceEncoder, uses_with={'device': 'cpu'})
with f:
    f.post(on='/encode', inputs=generate_docs, show_progress=True, request_size=32)

Running on GPU and CPU locally#

Let’s try it out by running the same code on CPU and GPU, so we can observe the speedup we can achieve.

To toggle between the two, simply set your device type to 'cuda', and your GPU will take over the work:

+ f = Flow().add(uses=SentenceEncoder, uses_with={'device': 'cuda'})
- f = Flow().add(uses=SentenceEncoder, uses_with={'device': 'cpu'})

Then, run the script:

python main.py

And compare the results

CPU

      executor0@26554[L]:ready and listening
        gateway@26554[L]:ready and listening
           Flow@26554[I]:🎉 Flow is ready to use!
        🔗 Protocol:            GRPC
        🏠 Local access:        0.0.0.0:56969
        🔒 Private network:     172.31.39.70:56969
        🌐 Public address:      52.59.231.246:56969
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━ 0:00:20 15.1 step/s 314 steps done in 20 seconds

GPU

      executor0@21032[L]:ready and listening
        gateway@21032[L]:ready and listening
           Flow@21032[I]:🎉 Flow is ready to use!
        🔗 Protocol:            GRPC
        🏠 Local access:        0.0.0.0:54255
        🔒 Private network:     172.31.39.70:54255
        🌐 Public address:      52.59.231.246:54255
Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━ 0:00:03 90.9 step/s 314 steps done in 3 seconds

Running this code on a g4dn.xlarge AWS instance with a single NVIDIA T4 GPU attached, we can see that the embedding time can be decreased from 20s to 3s by running on GPU. That is more than a 6x speedup! And that’s not even the best we can do - if we increase the batch size to max out the GPU’s memory we would get even larger speedups. But such optimizations are beyond the scope of this tutorial.

Note

You have probably noticed that there was a delay (about 3 seconds) when creating the Flow. This occured because the weights of our model needed to be transfered from CPU to GPU when we initialized the Executor. However, this action only occurs once in the lifetime of the Executor, so for most use cases this is not something we would worry about.

Using GPU in a container#

Using your GPU inside a container

For this part of the tutorial, you need nvidia-container-toolkit installed on your machine. If you haven’t installed that already, you can find an installation guide here.

When you’ll be using your Executor in production you will most likely want to put it in a Docker container, to provide proper environment isolation and to be able to use it easily on any device.

Using GPU-enabled Executors in this case is no harder than using them locally. In this case we don’t even need to modify the default Dockerfile.

Choosing the right base image

In our case we are using the default jinaai/jina:latest base image. However, parallel to the comments about having to install CUDA locally, you might need to use a different base image, depending on the framework you are using.

If you need to have CUDA installed in the image, you usually have two options: either you take the nvidia/cuda for the base image, or you take the official GPU-enabled image of the framework you are using, for example, tensorflow/tensorflow:2.6.0-gpu.

The other file we care about in this case is config.yml, and here the default version works as well. So let’s build the Docker image

docker build -t sentence-encoder .

You can run the container to quickly check that everything is working well

docker run sentence-encoder

Now, let’s use the Docker version of our encoder with the GPU. If you’ve dealt with GPUs in containers before, you probably remember that to use a GPU inside the container you need to pass --gpus all option to the docker run command. And Jina enables you to do just that.

Here’s how we need to modify our main.py script to use a GPU-base containerized Executor

from jina import Flow, Document, DocumentArray

from executor import SentenceEncoder

def generate_docs():
    for _ in range(10000):
        yield Document(
            text='Using a GPU enables you to significantly speed up encoding'
        )

f = Flow().add(
    uses='docker://sentence-encoder', uses_with={'device': 'cuda'}, gpus='all'
)
with f:
    f.post(on='/encode', inputs=generate_docs, show_progress=True, request_size=32)

If we run this with python main.py, we’ll get the same output as before, except that now we’ll also get the output from the Docker container.

You may notice that every time we start the Executor, the transformer model gets downloaded again. To speed this up, we would want the encoder to load the model from a file which we have pre-downloaded to our disk.

We can do this with Docker volumes - Jina will simply pass the argument to the Docker container. Here’s how we modify the main.py to allow that

f = Flow().add(
    uses='docker://sentence-encoder',
    uses_with={'device': 'cuda'},
    gpus='all',
    # This has to be an absolute path, replace /home/ubuntu with your home directory
    volumes="/home/ubuntu/.cache:/root/.cache",
)

Here we mounted the ~/.cache directory, because this is where pre-built transformer models are saved in our case. But this could also be any custom directory - depends on the Python package you are using, and how you specify the model loading path.

Now, if we run python main.py again you can see that no downloading happens inside the container, and that the encoding starts faster.

Using GPU with Hub Executors#

We now saw how to use GPU with our Executor locally, and when using it in a Docker container. What about when we use Executors from Jina Hub, is there any difference?

Nope! Not only that, many of the Executors on Jina Hub already come with a GPU-enabled version pre-built, usually under the gpu tag (see Jina Hub tags). Let’s modify our example to use the pre-built TransformerTorchEncoder from Jina Hub

f = Flow().add(
-   uses='docker://sentence-encoder',
+   uses='jinahub+docker://TransformerTorchEncoder/latest-gpu',
    uses_with={'device': 'cuda'},
    gpus='all',
    # This has to be an absolute path, replace /home/ubuntu with your home directory
    volumes="/home/ubuntu/.cache:/root/.cache"
)

You’ll see that the first time you run the script, downloading the Docker image will take some time - GPU images are large! But after that, everything will work just as it did with your local Docker image, out of the box.

Important

When using GPU encoders from Jina Hub, always use jinahub+docker://, and not jinahub://. As discussed above, these encoders might need CUDA installed (or other system dependencies), and installing that properly can be tricky. For that reason, you should prefer using Docker images, which already come with all these dependencies pre-installed.

Conclusion#

Let’s recap what we saw in this tutorial:

Using Executors on a GPU locally is no different than using GPU in a standalone script. You can pass the device you want your Executor to use in the initialization.
To use an Executor on a GPU inside a Docker container, make sure to pass gpus='all'
Use volumes (bind mounts), so you don’t have to download large files each time you start the Executor
You can use GPU with Executors from Jina Hub, just make sure to use the Executor with the gpu tag

And when you start building your own Executor, always remember to check what system requirements (CUDA and similar) are needed, and install them locally (and in the Dockerfile) accordingly