I build pipelines that don't break and AI systems that ship. From clinical knowledge graphs to multi-agent platforms in production — I care about what happens after the demo.
I grew up in Gujarat, India — class president four years straight at Dharmsinh Desai University. Not because it looked good on paper. Because I'm wired to build structure where there isn't any.
I moved to Boston in 2024 to do my Master's at Northeastern. Within a year I was doing two jobs at once under the same professor: building healthcare AI pipelines in production on GCP, and designing two graduate data engineering courses from scratch for 80+ students.
My philosophy is simple. I build systems that work after the demo ends. Every project has validation, observability, and a clear answer to "what happens when this breaks at 2am?" That's the line between data engineering and data science theater.
I graduate May 2026. I'm in an intensive job search right now — looking for a team that takes data infrastructure or applied AI seriously. If that's you, let's talk.
Just completed — a clinical AI reasoning engine for rare disease differential diagnosis. Not a chatbot. A diagnostic workflow engine.
A symptom-to-rare-disease reasoning engine. Enter symptoms, traverse the disease knowledge graph, surface Orpha-coded candidates with targeted follow-up questions.
Snowflake-native analytics with KG_NODES / KG_EDGES / KG_BUILD_META tables, Snowflake Cortex LLM functions, dbt marts, and an agentic LangGraph assessment path. Encodes domain expertise over 8,700+ HPO ontology terms across 9 authoritative medical sources — Orphanet, PubMed, PMC, OpenFDA, RxNorm, WHO, NCBI Bookshelf, OpenStax, and an inverted symptom index.
Three-tier evidence retrieval: local warehouse (ChromaDB semantic + Snowflake ILIKE) → live APIs (NCBI E-utilities, Europe PMC) → constrained web fallback. Evidence-quality gate refuses to fabricate and returns insufficient_evidence + follow-up questions when the signal is weak. Streamlit app, FastAPI backend, Airflow DAGs, Prometheus metrics, MLflow tracking, optional self-critique pass.
Filter by track or scroll for everything. The ones on the left are bigger for a reason.
Agentic LLM platform for private-equity-grade due diligence on the Forbes AI 50.
Airflow DAGs ingest 500+ documents per run into GCS, chunked at 800 tokens / 100 overlap into 26,985 vectors (384-dim, all-MiniLM-L6-v2) in Qdrant. A 5-node LangGraph workflow — Planner → Data Generator → Evaluator → Risk Detector → HITL — uses the ReAct pattern with full execution traces. An MCP server exposes the intelligence as tools. Side-by-side RAG vs Structured (Pydantic) pipelines scored head-to-head at 8.17/10.
Multi-agent intelligence for technical startup founders.
Seven specialized agents (Pitch, Competitive, Marketing, Patent, Policy, Team + Coordinator) on a shared LangGraph state. Weaviate Cloud with HNSW indexing. GPT-4 ↔ Gemini failover. MCP tools: DuckDuckGo, Pexels, USPTO, web extraction. Pitch agent integrates Gamma.ai across 5 themes. React + TypeScript + Vite frontend, FastAPI backend.
Cloud-native RAG over a 4,000-page Financial Toolbox.
Code-aware chunking with custom separator priority (1200-char chunks, 200 overlap) — 180/180 MATLAB code blocks preserved intact, 100% metadata preservation for citations. text-embedding-3-large at 3072 dimensions, Instructor-validated Pydantic outputs, Wikipedia fallback when concepts aren't in the PDF. Cloud Composer weekly refresh at $0.0024 per query.
End-to-end MLOps for real-time sentiment at production scale.
Complete MLOps lifecycle. Logistic Regression, Random Forest, XGBoost tracked via MLflow — winning model served by FastAPI on Kubernetes. Kafka live streaming, Prometheus + Grafana for latency / throughput / drift, GitHub Actions CI/CD.
Star schema over 90M rows, 10× query speedup.
Fact_Title_Ratings at proper grain + 5 dimensions + 2 bridge tables. Eliminates row explosion and metric inflation on many-to-many joins. ADF + Alteryx + Snowflake + Power BI.
First unified Chicago + Dallas health-safety model.
Two cities. Incompatible schemas (17 cols vs 114 cols). Different risk systems — categorical labels vs numeric scores. City-aware Alteryx ETL resolves the semantic mismatch into one trustworthy fact table.
5M+ records, MERGE-based incremental refresh.
Snowflake dimensional model. Power BI trends by demographics, offense type, precinct, time. Designed for ongoing policy evaluation, not static reports.
SEC 10-K / 10-Q parsing with dual open-source engines + cloud benchmark.
Two complementary parsers: pdfplumber for speed, Docling for layout-aware extraction with reading-order and bounding-box provenance. Optional Google Document AI benchmark. XBRL cross-checks catch scaling mismatches (e.g., "in millions"). Full DVC reproducibility with metrics tracked.
Automated Dow-30 earnings pipeline with 3-tier Selenium scraping.
100% IR discovery across all 30 Dow Jones companies via hybrid strategy (subdomain patterns, homepage anchors, pattern guessing, DuckDuckGo fallback). Three-tier scraping: Requests → Selenium navigation → aggressive DOM manipulation. Docling parses top pages; Instructor + GPT-4 extracts 26 metadata fields per doc. Airflow to GCS.
Four roles. Two countries. One throughline — systems that work at scale.
Research: Building production GCP pipelines (Cloud Composer, BigQuery, PySpark) processing 867 clinical records + 1,200+ EMR cases. Multi-stage LLM extraction with schema validation and PII controls — 97% structured output compliance, 85% error reduction.
Teaching: Designed Database Management (SQL/Oracle) and Data Integration courses from scratch — curriculum, labs, assignments — then delivered both to 80+ graduate students. Both roles under the same professor.
Built a cross-lingual information retrieval system translating English queries into Hindi, Gujarati, and Bengali. NLP pipelines for query translation, semantic matching, and cross-language relevance ranking. Genuine research — no existing solution to benchmark against.
Built SQL reporting tables processing 30GB of sales data with automated weekly refresh — standardizing KPIs across 6 Power BI and Tableau dashboards. Refactored join logic, eliminated duplicate-driven overcounting, cut dashboard failures by 40%. Built an internal real-time monitoring tool that improved decision-making efficiency by 35%.
From raw ingestion to deployed AI — and all the infrastructure in between.
Led execution and finance for large-scale cultural events — budgeting, logistics, on-ground coordination. Also served as Head of Photography & Video, managing event coverage teams. Grew from Associate (2021) into dual-director role.
Elected every year of my undergrad. Primary liaison between students and faculty — coordinating academics, advocating for student needs, building trust across a large peer group. Not a one-time win. A four-year record.
Volunteer with Northeastern's South Asian cultural org — event planning, cultural programming, and community building for the South Asian student community in Boston.
Available for full-time Data Engineering, Analytics Engineering, and AI/ML Engineering roles from May 2026. Open to anywhere in the US.
talati.ak@northeastern.edu