Sasidhar Chintapalli

I build systems that solve messy data problems.

Seventeen years of designing distributed data platforms, AI-powered applications, and the unglamorous plumbing in between. This site isn't a résumé — it's a set of problems I picked up, how I thought about them, and what I'd do differently now.

Case Studies

Three problems, framed the way I actually think about them: what was broken, what I tried, what shipped, and what I'd do differently if I started again tomorrow.

CASE 01 Current

An AI agent that discovers relationships across messy enterprise schemas

Because context is what makes LLMs useful on your data — and context lives in the joins.

customers id, name, email region, tier, signup_dt orders id, cust_ref amount, order_date products sku, name, price category, vendor LangGraph Agent schema walker • FK inference semantic edge proposer walks metadata Knowledge Graph of Relationships customers → orders (cust_ref = id) • orders → products (sku) used downstream for context assembly & GraphRAG retrieval
agent reads the schema, returns the graph you wish someone had documented

The Problem

Enterprise data is scattered across dozens of schemas with inconsistent naming, missing foreign keys, and tribal knowledge living in the heads of people who left the company two reorgs ago. To answer any non-trivial question with an LLM, you first need to know which tables belong together and how — otherwise you're stuffing irrelevant context into a prompt and praying. Manually mapping this is brittle, doesn't scale, and rots the moment a schema changes.

Approach

I considered three options. One: one-shot the whole catalog into a frontier LLM and ask it for relationships — rejected because token limits die at real-world scale and the model hallucinates joins that don't exist. Two: pure vector similarity over column names and sample values — rejected because semantic similarity ≠ an actual join key, and precision suffers. Three: an agentic walker that combines metadata signals (column names, types, cardinality, sample overlap) with LLM reasoning on narrow, bounded slices — this is what I picked. It lets the model make scoped inferences without drowning in context.

Solution

A LangGraph-based agent orchestration with specialized nodes: a schema walker that crawls the metadata catalog, a candidate proposer that flags potential joins using heuristics plus LLM scoring, a validator that checks sample overlap before committing, and a graph writer that emits edges into a knowledge graph. The output feeds a GraphRAG retrieval layer that downstream agents use for context assembly when answering questions on enterprise data.

Outcome

Currently in active development as part of the AI-Native Data Layer I'm building at Uniphore. Early results on internal benchmarks show the discovered graph meaningfully improves retrieval precision over naive semantic search, but I want to be honest: it's too early for production numbers I'd stand behind publicly. I'll update this section when there's something real to report.

4 Agent types
5 Platform layers
“production numbers coming”

What I'd Do Differently

Start with an evaluation harness before building the agent. I spent real time tuning heuristics without a rigorous way to measure regressions, and that always comes back to bite you. I'd also keep the first version simpler: a deterministic walker with LLM scoring only at the edges, rather than full agentic orchestration. Agents are fun but they obscure failure modes that a boring pipeline would surface immediately.

LangGraph GraphRAG Agentic AI Knowledge Graph Schema Inference LLMs
CASE 02 Recent Past

Automating migration of thousands of legacy ETL workloads to cloud warehouses

The enterprise equivalent of translating a library from one dead language to three living ones.

legacy sources Informatica Legacy SQL Hive / Oozie ANTLR4 Reader grammar → AST Middleware JSON IR portable • versioned hub-and-spoke IR Writers target-specific targets Snowflake Databricks SQL
one reader per legacy system, one writer per target — the JSON in the middle is the whole trick

The Problem

A Fortune 500 client had years of business logic locked in Informatica mappings, legacy SQL jobs, and an older on-prem data platform. They needed it all on Snowflake and Databricks SQL — not in five years, but on a realistic migration timeline. Hand-rewriting thousands of pipelines was economically impossible and guaranteed to drop edge cases. The real constraint wasn't just automation — it was building something that could target multiple cloud warehouses without writing N×M translators.

Approach

Obvious option: point-to-point translators (Informatica→Snowflake, Informatica→Databricks, Hive→Snowflake…). Rejected — that's the classic N×M trap where every new source or target multiplies the work. I designed a hub-and-spoke architecture with a middleware JSON intermediate representation. Readers parse the source into the IR using ANTLR4 grammars; writers emit target-specific SQL/config from the same IR. Adding a new source or target becomes O(1) instead of O(N×M).

Solution

ANTLR4-based parsers for each legacy source, producing an AST walked by a reader into a versioned middleware JSON schema. The IR captured not just SQL but transformation semantics, lineage, scheduling metadata, and parameterization. Writers for each target consumed the IR and emitted native SQL plus orchestration config. A validation layer diffed source vs. translated outputs on sample data to catch semantic drift before customers ran anything in production.

Outcome

The framework onboarded Fortune 500 clients running thousands of pipelines per day on the new cloud targets, with the same architecture supporting Snowflake, Databricks, BigQuery, and EMR without separate codebases.

1000s Pipelines translated
4 Target platforms
N+M Instead of N×M

What I'd Do Differently

I'd invest earlier in a differential testing harness that auto-generates fixtures from source pipelines and validates writer output on real data samples. We built this eventually, but running without it in the early phases meant discovering semantic mismatches much later than we should have. I'd also version the middleware IR more aggressively from day one — schema evolution on the hub is painful when you've got downstream writers depending on older shapes.

ANTLR4 Grammar Parsing Snowflake Databricks IR Design ETL Migration
CASE 03 Side Project

Kooklive: a live-streaming platform for home cooks teaching classes during the pandemic

Built nights and weekends when half the world was suddenly learning to cook.

hosted on Google Cloud Android App learners + cooks discover • join • chat Node + Express + Python workers auth • sessions payments • scheduling MongoDB Zoom SDK live video sessions Firebase realtime chat direct client → Zoom (video)
classic client/server, but video and chat handed off to managed services — so I could sleep

The Problem

Pandemic lockdowns meant millions of people suddenly wanted to cook at home, and a lot of talented home cooks had time on their hands and something to teach. YouTube recipes aren't classes — there's no interaction, no scheduling, no way for a social cook to run a paid class with their own students. I wanted to build a dedicated platform where home cooks could monetize live cooking lessons. The constraint: it had to ship fast (the window was open now) and I was building it nights and weekends.

Approach

The tempting path was to build a WebRTC video stack from scratch. Rejected immediately — that's a full-time job for a team, not a side project. Same with building a real-time chat layer on raw WebSockets. The right move for a time-constrained MVP was to pick best-in-class managed services for the hard parts and focus my time on the domain-specific logic: class discovery, scheduling, host onboarding, and the mobile experience. I chose Zoom SDK for video and Firebase for chat because both were battle-tested and I could integrate them in days, not months.

Solution

Node.js + Express backend on Google Cloud with MongoDB for the application data, Python workers for background jobs like notifications and scheduling, an Android app as the primary client, Zoom SDK embedded for live class sessions, and Firebase for real-time chat during classes. Auth and session handling in the backend, but the heavy lifting (video frames, chat fan-out) stayed with the managed services.

Outcome

Kooklive shipped as a working MVP with real cooks running live classes. It was a side project rather than a funded startup, so I won't pretend we hit massive scale — the honest version is that it proved the product hypothesis end-to-end, gave me a ton of validated learning about managed service composition, and became a nice demo of what one person can ship on evenings and weekends when the primitives are chosen well.

1 Engineer (me)
MVP Shipped end-to-end
“honest: side project scale”

What I'd Do Differently

I'd start on the web instead of Android-first. Mobile felt right for learners but Android-only cut my addressable audience in half and an iOS build from a solo dev was never going to happen. A responsive web client would have given me both platforms for the price of one. I'd also validate the monetization loop before building scheduling — I built the features I thought a class platform needed, rather than the features that proved people would actually pay.

Node.js Express MongoDB Android Zoom SDK Firebase GCP

Blog

Things I've figured out and wanted to write down — mostly about AI, SQL parsing, and the occasional deep dive.

Follow on Medium