# Open-Domain Question-Answering on Long Document¶

This tutorial will take you through a solution of a question-answering on long document. This is an inherently difficult task, due to the fuzziness of human language and the infinite number of questions one could ask.

One way to solve this is by predicting answers using a neural network that was trained on pairs of questions and their corresponding answers. In many cases such a dataset is not available, like in the case of most software documentation. Let’s say we want to build a chatbot to answer questions about the Jina documentation. What if I told you that there is a way to reframe this task as a search problem and that this would alleviate the need for a large dataset of matching questions and answers?

How, you ask? Let me explain!

## Overview¶

Our approach to the problem leverages the Doc2query method, which, form a piece of text, predicts different questions the text could potentially answer. For example, given a sentence such as Jina is an open source framework for neural search., the model predicts questions such as What is Jina? or Is Jina open source?.

The idea here is to predict several questions for every part of the original text document, in our case the Jina documentation. Then we use an encoder to create a vector representation for each of the predicted questions. These representations are stored and provide the index for our body of text. When a user prompts the bot with a question, we encode it in the same way we encoded our generated questions. Now we can run a similarity search on the encodings. The encoding of the user’s query is compared with the encodings in our index to find the closes match.

Since we know what part of the original text was used to generate the question, that was most similar to the user’s query, we can return the original text as an answer to the user.

Now that you have a general idea of what we will be doing, the following section will show you how to define our Flows in Jina. Then we will take a look at how to implement the necessary Executors for our search-based question-answering system.

## Indexing the text document¶

Let’s imagine we extracted a bunch of sentences from Jina’s documentation and stored them in a DocumentArray, as shown below.

example_sentences = [
"Document is the basic data type that Jina operates with",
"Executor processes a DocumentArray in-place",
...,
"Jina uses the concept of a flow to tie different executors together"
]

docs = DocumentArray([Document(content=sentence) for sentence in example_sentences])


As described in the last section, we first need to predict potential questions for each of the elements in the DocumentArray. Then we have to use another model to create vector encodings from the predicted questions. Finally, we store them as the index.

At this point we have enough information to start defining our Flows.

indexing_flow = Flow(
# Generate potential questions using doc2query
uses=QuestionGenerator,
uses_with={"random_seed": 12345}
# Create vector representations for generated questions
uses=TextEncoder,
uses_with={"parameters": {"traversal_paths": ["c"]}}
# Store embeddings for all generated questions as index
uses=MyIndexer
)

with indexing_flow:
# Run the indexing on all extracted sentences
indexing_flow.post(on="/index", inputs=docs, on_done=print)


## Searching of the user’s query against the index¶

After having defined the Flow for indexing our document, we are now ready to work on answering user queries. Incoming queries also need to be encoded. For that, we use the same encoder that we used for encoding our generated questions. Then we need SimpleIndexer to perform similarity search, in order to retrieve generated questions and eventually answers the query.

The flow for searching is much simpler than the one for indexing and looks like this:

query_flow = Flow(
# Create vector representations from query
# Search the index for matching generated questions

with query_flow:
indexing_flow.post(on="/query", inputs=user_queries, on_done=print)


Now that we have seen the overall structure of the approach and have defined our Flows, we can code up the Executors.

## Building the Executor to Generate Potential Questions¶

The first Executor, thatÏ we implement, is the QuestionGenerator. It is a wrapper around the model that predicts potential questions, which a given piece of text can answer.

Apart from that, it just loops over all provided parts of input text. After potential questions are predicted for each of the inputs, they are stored as chunks alongside the original text.

class QuestionGenerator(Executor):

@requests
def doc2query(self, docs: DocumentArray, **kwargs):
"""Generates potential questions for each answer"""

self._tokenizer = T5Tokenizer.from_pretrained(
'castorini/doc2query-t5-base-msmarco')
self._model = T5ForConditionalGeneration.from_pretrained(
'castorini/doc2query-t5-base-msmarco')

for d in docs:
input_ids = self._tokenizer.encode(
d.content, return_tensors='pt')
# Generte potential queries for each piece of text
outputs = self._model.generate(
input_ids=input_ids,
max_length=64,
do_sample=True,
num_return_sequences=10,
)
# Decode the outputs ot text and store them
for output in outputs:
question = self._tokenizer.decode(
output, skip_special_tokens=True).strip()
d.chunks.append(Document(text=question))


We try to give credit where credit is due and want to mention the paper, that introduced the doc2query approach here.

## Building the Encoder¶

The next step is to build the Executor, which we will use to create vector representations from human-readable text.

class TextEncoder(Executor):

def __init__(self):
self.model = SentenceTransformer(
'paraphrase-mpnet-base-v2', device="cpu", cache_folder=".")

@requests(on=['/search', '/index'])
def encode(self, docs: DocumentArray,
traversal_paths: Tuple[str] = ('r',), **kwargs):
"""Wraps encoder from sentence-transformers package"""
target = docs.traverse_flat(traversal_paths)

with torch.inference_mode():
embeddings = self.model.encode(target.texts)
target.embeddings = embeddings


Similar to the QuestionGenerator the TextEncoder is simply a wrapper around the SentenceTransformer from the sentence_transformer package. When provided with a DocumentArray containing text, it will encode the text of each element and store the result in the embedding attribute it creates.

Now let’s move on to the last part and create the indexer.

## Putting it Together with the Indexer¶

The indexer is the only one of our Executors that can handle more than one task. Namely, the indexing and the search.

When it is used to perform indexing, index() is called. This stores all provided documents, together with their embeddings, as a DocumentArrayMemmap.

However, when the SimpleIndexer is used to handle an incoming query, the search() function is called, it performs similarity search and ranks the results.

class SimpleIndexer(Executor):
"""Simple indexer class"""

def __init__(self, **kwargs):
super().__init__(**kwargs)
self._docs = DocumentArrayMemmap(".")

@requests(on='/index')
def index(self, docs: 'DocumentArray', **kwargs):
# Stores the index in attribute
if docs:
self._docs.extend(docs)

@requests(on='/search')
def search(self, docs: 'DocumentArray', **kwargs):
"""Append best matches to each document in docs"""

# Match query agains the index using cosine similarity
docs.match(
DocumentArray(self._docs),
metric='cosine',
normalization=(1, 0),
limit=100,
traversal_rdarray=['c'],
)

for d in docs:
match_similarity = defaultdict(float)
# For each match
for m in d.matches:
# Get cosine similarity
match_similarity[m.parent_id] = m.scores['cosine'].value

sorted_similarities = sorted(
match_similarity.items(), key=lambda v: v[1], reverse=True)

# Rank the matches by similarity
for k, v in sorted_similarities:
m = Document(self._docs[k], copy=True)
d.matches.append(m)
if len(d.matches) >= 10:
break
d.pop('embedding')


The ranking of the results is thereby represented in the order of the matches inside the matches object. Hence, to provide the answer to the user, we could use a little helper function that gets the id of the best-fitting match and searches the index for the sentence with this id.

best_matching_id = user_queries[0].matches[0].id

"""Get the answer for most similar question"""
ret = None
for doc in docs:
# Search all questions for each sentence
for c in doc.chunks:
# Get the question that fits best
if c.id == best_matching_id:
# Return the answer to best fitting question
ret = doc.text
return ret
# Prints the answer text to our question