Open to opportunities

Hi, I'm Siddhesh ๐Ÿ‘‹

I work at the intersection of AI research and engineering.

MS student at ASU building RAG systems, agentic pipelines, and products people actually use.

SM

About

The question I keep coming back to: how do you make a language model less confidently wrong? I work on that as a Research Aide at ASU (4.0 GPA), and I also ship products that 3,000+ people use.

The research side covers cross-model uncertainty quantification, dense retrieval that stays robust on multi-hop queries, and Text-to-SQL systems that beat GPT-4 using local models. Three papers so far, one published.

The builder side: Referrlyy has 2,200+ users finding job referrals. BoltPrep has 1,000+ prepping for interviews. Won two hackathons building for nonprofits in under 36 hours. If it can be shipped, I will ship it.

Currently in Tempe, AZ, where the sun is a personal enemy and I still wear a hoodie indoors. Always reading something, always building something.

GitHub contribution graph

Work Experience

Arizona State University
Arizona State UniversityCIS Research Aide
Dec 2025 โ€“ Present
  • Architected a production RAG pipeline with LangChain and FAISS on AWS Lambda, serving 500+ concurrent academic queries daily with zero downtime via GitHub Actions CI/CD.

  • Engineered a Redis caching layer with PostgreSQL connection pooling that cut query latency 60% (800ms to 320ms), now the system's primary performance lever.

  • Built FastAPI microservices with stateful chat-history retrieval, enabling persistent conversation context across sessions for an AI academic assistant.

Ayuarogya Saukhyam Foundation
Ayuarogya Saukhyam FoundationSoftware Development Intern
Sept 2024 โ€“ Mar 2025
  • Shipped a Flutter and Supabase mobile app from zero to 300+ rural Indian users, with OAuth 2.0 (Google Sign-In), real-time sync, and offline-first UX.

  • Built full multi-language support (English, Hindi, Marathi) and voice navigation using Flutter TTS/STT, making the app usable without English literacy.

  • Redesigned the onboarding flow, cutting completion time from 8 to 5 minutes for users unfamiliar with smartphones.

Digibranders Private Limited
Digibranders Private LimitedPython Development Intern
Jun 2023 โ€“ Aug 2023
  • Built RESTful APIs for an ERP system using Django REST Framework, sustaining 50K+ daily requests at sub-100ms average latency.

  • Architected a Redis caching and Celery async task queue layer, decoupling CPU-heavy report jobs from the main request cycle.

  • Diagnosed N+1 query patterns in PostgreSQL and added composite indices, cutting report generation from 45 seconds to 6 seconds.

Skills

Languages

Python
TypeScript
C/C++
SQL

Frameworks

React
Next.js
FastAPI
Flutter
Django
Node.js

ML / AI

PyTorch
LangChain
LangGraph
HuggingFace
FAISS
ONNX
RAG Systems
Agentic AI
LLM Fine-tuning

Infrastructure

PostgreSQL
Redis
Docker
AWS
MLflow

Selected Work

Research tools, shipped products, and hackathon builds.

01
CaseTrackWiCS x OpHack 2026

Won WiCS x OpHack 2026

02
Waste2WealthLA Hacks 2026

Won LA Hacks 2026

03

93.7% accuracy on Spider 1.0, beating GPT-4 using local LLMs

04

2,200+ installs ยท #1 Product of the Week

05

1,000+ users ยท #2 Product of the Week

06
07
08

Publications

Three papers on making language models more reliable. Hover the highlighted terms for details.

Cross-Model Semantic Entropy: Uncertainty-Aware Aggregation of Heterogeneous LLMs for Factual Question Answering

Different LLMs hallucinate on different questions. A model that fails on obscure science may be reliable on history. We introduce CMSE, a training-free method that quantifies this cross-model disagreement. The selective cascade variant S-CMSE beats majority voting by +4.6pp on TruthfulQA while calling only 2.0 models per query on average. On contested questions where models genuinely disagree (top 15.7% by entropy), CMSE wins by +12.5pp. Uncertainty AUROC = 0.765.

VLDB Workshop 2026

Beyond HyDE: Cross-Model Hypothesis Diversity for Robust Dense Retrieval

HyDE degrades multi-hop retrieval by up to 7.7pp on HotPotQA. We propose CMHA, which aggregates hypotheses across N diverse LLMs, recovering the degradation and reaching R@10 = 0.8242 on NQ. We also report the first empirical evidence that chain-of-thought thinking models catastrophically break HyDE (up to 68pp degradation on HotPotQA). A diversity-based adaptive routing policy achieves 99.9% of full CMHA recall at 3.8 average LLM calls per query.

Isolating Architectural Effects in Text-to-SQL: A Controlled Diagnostic Study Using the Spider Sandbox

Text-to-SQL evaluations routinely co-optimize model, prompt, retrieval, and generation strategy simultaneously, making it impossible to know whether a result reflects the model or the engineering. We fix every component of ADAPT-SQL across three architecturally diverse models. The key finding: models score within 0.8pp on 89% of Spider queries regardless of parameter scale. The gap opens only on nested-complex queries, where MoE routing unlocks a +4.4pp advantage for Qwen3-235B (22B active) over Gemma4-31B (31B active, dense), an ordering inconsistent with active-parameter count alone. 91.0% execution accuracy on Spider 1.0, competitive with fine-tuned systems.