Fuzzy String Matching in 30 Lines

Different behavior on Jupyter Notebook

Be aware of the following when running this tutorial in jupyter notebook. Some python built-in attributes such as __file__ do not exist. You can change __file__ for any other file path existing in your system.

Now that you understand all fundamental concepts, let’s practice the learnings and build a simple end-to-end demo.

We will use Jina to implement a fuzzy search solution on source code: given a snippet source code and a query, find all lines that are similar to the query. It is like grep but in fuzzy mode.

Client-Server architecture



Character embedding

Let’s first build a simple Executor for character embedding:

import numpy as np
from jina import DocumentArray, Executor, requests

class CharEmbed(Executor):  # a simple character embedding with mean-pooling
    offset = 32  # letter `a`
    dim = 127 - offset + 1  # last pos reserved for `UNK`
    char_embd = np.eye(dim) * 1  # one-hot embedding for all chars

    def foo(self, docs: DocumentArray, **kwargs):
        for d in docs:
            r_emb = [ord(c) - self.offset if self.offset <= ord(c) <= 127 else (self.dim - 1) for c in d.text]
            d.embedding = self.char_embd[r_emb, :].mean(axis=0)  # average pooling

Indexer with Euclidean distance

from jina import DocumentArray, Executor, requests

class Indexer(Executor):
    _docs = DocumentArray()  # for storing all documents in memory

    def foo(self, docs: DocumentArray, **kwargs):
        self._docs.extend(docs)  # extend stored `docs`

    def bar(self, docs: DocumentArray, **kwargs):
        docs.match(self._docs, metric='euclidean', limit=20)

Put it together in a Flow

from jina import Flow

f = (Flow(port_expose=12345, protocol='http', cors=True)
        .add(uses=CharEmbed, replicas=2)
        .add(uses=Indexer))  # build a Flow, with 2 shard CharEmbed, tho unnecessary

Start the Flow and index data

from jina import Document

with f:
    f.post('/index', (Document(text=t.strip()) for t in open(__file__) if t.strip()))  # index all lines of _this_ file
    f.block()  # block for listening request


open(__file__) means open the current file and use it for indexing. Note in some enviroment such as Jupyter Notebook and Google Colab, __file__ is not defined. In this case, you may want to replace it to open('my-source-code.py').

Query via SwaggerUI

Open http://localhost:12345/docs (an extended Swagger UI) in your browser, click /search tab and input:

  "data": [
      "text": "@requests(on=something)"

That means, **we want to find lines from the above code snippet that are most similar to @request(on=something).**Now click Execute button!


Query from Python

Let’s do it in Python then! Keep the above server running and start a simple client:

from jina import Client, Document
from jina.types.request import Response

def print_matches(resp: Response):  # the callback function invoked when task is done
    for idx, d in enumerate(resp.docs[0].matches[:3]):  # print top-3 matches
        print(f'[{idx}]{d.scores["euclidean"].value:2f}: "{d.text}"')

c = Client(protocol='http', port=12345)  # connect to localhost:12345
c.post('/search', Document(text='request(on=something)'), on_done=print_matches)

, which prints the following results:

         [email protected][S]:connected to the gateway at localhost:12345!
[0]0.168526: "@requests(on='/index')"
[1]0.181676: "@requests(on='/search')"
[2]0.218218: "from jina import Document, DocumentArray, Executor, Flow, requests"