Multimodal Video QA — Siddhesh More

A system that answers natural language questions about video content by combining visual frame retrieval, audio transcription, and a fine-tuned vision-language model.

Frame extraction uses PyAV and SceneDetect to sample keyframes at scene boundaries. Each frame gets a CLIP ViT-L/14 embedding computed locally. The audio track is transcribed by Whisper with word-level timestamps, and the transcript is chunked and embedded with bge-large-en-v1.5. Two FAISS indexes sit side by side: one visual (IndexFlatIP for cosine over CLIP embeddings), one textual. At query time both indexes retrieve independently and Reciprocal Rank Fusion merges the ranked lists. RRF avoids the score normalization problem that comes from mixing cosine distances from two different embedding spaces.

The VLM (Qwen2-VL-7B) is fine-tuned with QLoRA: 4-bit NF4 quantization via bitsandbytes, LoRA adapters at rank 16 targeting q_proj and v_proj attention projections. Only about 0.1% of parameters are trained; the base model stays frozen. HuggingFace Trainer handles the training loop, W&B logs loss curves and validation QA accuracy.

Served with vLLM. The CLIP encoder is exported to ONNX for edge inference benchmarking (latency comparison: BF16 vs INT4). An MCP tool server exposes @mcp.tool() def video_qa(query, video_id), making it callable directly from the LangGraph agent in Project 1.

In progress.