Hi, I'm Siddhesh ๐
I work at the intersection of AI research and engineering.
MS student at ASU building RAG systems, agentic pipelines, and products people actually use.
About
The question I keep coming back to: how do you make a language model less confidently wrong? I work on that as a Research Aide at ASU (4.0 GPA), and I also ship products that 3,000+ people use.
The research side covers cross-model uncertainty quantification, dense retrieval that stays robust on multi-hop queries, and Text-to-SQL systems that beat GPT-4 using local models. Three papers so far, one published.
The builder side: Referrlyy has 2,200+ users finding job referrals. BoltPrep has 1,000+ prepping for interviews. Won two hackathons building for nonprofits in under 36 hours. If it can be shipped, I will ship it.
Currently in Tempe, AZ, where the sun is a personal enemy and I still wear a hoodie indoors. Always reading something, always building something.
Work Experience

Architected a production RAG pipeline with LangChain and FAISS on AWS Lambda, serving 500+ concurrent academic queries daily with zero downtime via GitHub Actions CI/CD.
Engineered a Redis caching layer with PostgreSQL connection pooling that cut query latency 60% (800ms to 320ms), now the system's primary performance lever.
Built FastAPI microservices with stateful chat-history retrieval, enabling persistent conversation context across sessions for an AI academic assistant.

Shipped a Flutter and Supabase mobile app from zero to 300+ rural Indian users, with OAuth 2.0 (Google Sign-In), real-time sync, and offline-first UX.
Built full multi-language support (English, Hindi, Marathi) and voice navigation using Flutter TTS/STT, making the app usable without English literacy.
Redesigned the onboarding flow, cutting completion time from 8 to 5 minutes for users unfamiliar with smartphones.

Built RESTful APIs for an ERP system using Django REST Framework, sustaining 50K+ daily requests at sub-100ms average latency.
Architected a Redis caching and Celery async task queue layer, decoupling CPU-heavy report jobs from the main request cycle.
Diagnosed N+1 query patterns in PostgreSQL and added composite indices, cutting report generation from 45 seconds to 6 seconds.
Skills
Languages
Frameworks
ML / AI
Infrastructure
Selected Work
Research tools, shipped products, and hackathon builds.
Publications
Three papers on making language models more reliable. Hover the highlighted terms for details.
Cross-Model Semantic Entropy: Uncertainty-Aware Aggregation of Heterogeneous LLMs for Factual Question Answering
Different LLMs hallucinate on different questions. A model that fails on obscure science may be reliable on history. We introduce CMSE, a training-free method that quantifies this cross-model disagreement. The selective cascade variant S-CMSE beats majority voting by +4.6pp on TruthfulQA while calling only 2.0 models per query on average. On contested questions where models genuinely disagree (top 15.7% by entropy), CMSE wins by +12.5pp. Uncertainty AUROC = 0.765.
VLDB Workshop 2026
Beyond HyDE: Cross-Model Hypothesis Diversity for Robust Dense Retrieval
HyDE degrades multi-hop retrieval by up to 7.7pp on HotPotQA. We propose CMHA, which aggregates hypotheses across N diverse LLMs, recovering the degradation and reaching R@10 = 0.8242 on NQ. We also report the first empirical evidence that chain-of-thought thinking models catastrophically break HyDE (up to 68pp degradation on HotPotQA). A diversity-based adaptive routing policy achieves 99.9% of full CMHA recall at 3.8 average LLM calls per query.
Isolating Architectural Effects in Text-to-SQL: A Controlled Diagnostic Study Using the Spider Sandbox
Text-to-SQL evaluations routinely co-optimize model, prompt, retrieval, and generation strategy simultaneously, making it impossible to know whether a result reflects the model or the engineering. We fix every component of ADAPT-SQL across three architecturally diverse models. The key finding: models score within 0.8pp on 89% of Spider queries regardless of parameter scale. The gap opens only on nested-complex queries, where MoE routing unlocks a +4.4pp advantage for Qwen3-235B (22B active) over Gemma4-31B (31B active, dense), an ordering inconsistent with active-parameter count alone. 91.0% execution accuracy on Spider 1.0, competitive with fine-tuned systems.
