Question-Answering on in-Video Content¶
Asking questions is a natural way to perform a search. When you want to know the definition of Document in Jina, you will naturally ask, “What is the Document in Jina?”. The expected answer can be found either from Jina’s docs or the introduction videos on Jina’s YouTube channel. Thanks to the latest advances in NLP, AI models can automatically find these answers from the content.
The goal of this tutorial is to build a Question-Answering (QA) system for video content. Although most existing QA models only work for text, most videos in our life have speech which contains rich information about the video and can be converted to text via speech recognition (STT). Thereafter, videos with speech naturally fit question-answering via text.
In this tutorial, we will show you how to find and extract content from videos that answers a query question. Instead of just finding related videos and having the user skim through the whole video, QA models can tell the user which second they should start from to get the answer to their question.
Build the Flow¶
To convert speech information from the videos into text, we can rely on STT algorithms. Fortunately, for most videos on YouTube, you can download the subtitles that are generated automatically via STT. In this example, we assume the video files already have subtitles embedded. By loading these subtitles, we can get the text of the speech together with the beginning and ending timestamps.
You can use
youtube-dl to download YouTube videos with embedded subtitles:
youtube-dl --write-auto-sub --embed-subs --recode-video mkv -o zvXkQkqd2I8 https://www.youtube.com/watch\?v\=zvXkQkqd2I8
Subtitles generated with STT are not 100% accurate. Usually, you need to post-process the subtitles. For example, in
the toy data, we use an introduction video of Jina. In the auto-generated subtitles,
Jina is misspelled as
gina, etc. Worse still, most of the sentences are broken and there is no punctuation.
With the subtitles of the videos, we further need a QA model. The input to the QA model usually has two parts: the question and the context. The context denotes the candidate texts that contain the answers. In our case, the context corresponds to the subtitles from which the answers are extracted.
To save computational cost, we want to have the context as short as possible. To generate such contexts, one can use either traditional information sparse vectors or dense vectors. In this example, we decide to use the dense vectors that are shipped together with the QA model.
With traditional methods, retrieval can also be done using BM25, Tf-idf, etc.
VideoLoader to extract subtitles from the videos. It uses
ffmpeg to extract subtitles
and then generates chunks based on the subtitles using
webvtt-py. The subtitles are stored in the
together with other meta-information in the
tags, including timestamp and video information. Extracted subtitles have the following attributes:
||Text of the subtitle|
||Index of the subtitle in the video, starting from
||always set to
||Beginning of the subtitle in seconds|
||End of the subtitle in seconds|
||URI of the video|
An embedding model to encode the questions and answers into vectors in the same space. This way, we can retrieve candidate sentences that are most likely to contain the answer.
A reader model that extracts exact answers from candidate sentences.
DPR is a set of tools and models for open domain Q&A tasks.
For the indexer, we choose
SimpleIndexer for demonstration purposes. It stores both vectors and meta-information together. You can find more information on Jina Hub
Go through the Flow¶
Because the indexing and querying Flows have only one shared Executor, we create separate Flows for each task.
The index request contains Documents that have the path information of the video files stored in their
There are three Executors in the index Flow:
VideoLoaderextracts the subtitles and stores them in
chunks. Each chunk of the Document has one subtitle stored in its
DPRTextEncoderencodes the subtitles into vectors.
SimpleIndexerstores the vectors and other meta-information
There are four Executors in the query Flow:
DPRTextEncodertakes the question stored in the
textattribute of the query Document and encodes it into a vector.
SimpleIndexerretrieves related subtitles by finding the nearest neighbours in the vector space. The retrieved results are stored in the
matchesattribute of the query Document. Each Document in the
matchesalso has all the meta-information about the subtitles, which is retrieved by
SimpleIndexertogether with subtitle text.
DPRReaderRankerfinds exact answers by using the question and the candidate subtitles. The question and candidate subtitles are stored in the
textattributes of the Document and its
matchesrespectively. Replacing the existing
DPRReaderRankerstores the best-matched answers in the
textattribute of the
matches. Other meta-information is also copied into the new matches, including
DPRReaderRankerreturns two types of
scores['relevance_score']measures the relevance to the question of the subtitle from which the answer is extracted. The
scores['span_score']indicates the weight of the extracted answer among the subtitles.
Text2Framegets the video frame information from the retrieved answers and prepares the Document
matchesfor displaying in the frontend.
The overall structure of the query Flow is as follows:
DPRTextEncoder differently in two Flows¶
You might note that
DPRTextEncoder is used in both the index and query Flows:
In the index Flow it encodes subtitle text
In the query Flow it encodes query questions
In these two cases, we need to choose different models to encode the different attributes of the Documents. To achieve this, we use different initialization settings for
DPRTextEncoder by overriding the
with arguments in the YAML file. To do this, we need to pass the new argument to
uses_with. You can find more information in Jina’s docs.
# index.yml ... - name: encoder uses: jinahub://DPRTextEncoder/v0.2 uses_with: pretrained_model_name_or_path: 'facebook/dpr-ctx_encoder-single-nq-base' encoder_type: 'context' traversal_paths: - 'c' ...
# query.yml ... - name: encoder uses: jinahub://DPRTextEncoder/v0.2 uses_with: pretrained_model_name_or_path: 'facebook/dpr-question_encoder-single-nq-base' encoder_type: 'question' batch_size: 1 ...
Get the Source Code¶
You can find the code at example-video-qa.
Most of the Executors used in this tutorial are available on Jina Hub:
In this example, we rely on subtitles embedded in the video. For videos without subtitles, we need to build Executors using STT models to extract speech information. If the video contains other sounds, you can resort to VADSpeechSegmenter for separating speech beforehand.
Another direction to extend this example is to consider the videos’ other text information. While subtitles contain rich information about the video, not all text information is included in subtitles. A lot of videos have text information embedded in images. In such cases, we need to rely on OCR models to extract text information from the video frames.
Overall, searching in-video content is a complex task and Jina makes it a lot easier.