Aksh Talati — Data & AI Systems Engineer

01 — The Story

Who I am.

I grew up in Gujarat, India — class president four years straight at Dharmsinh Desai University. Not because it looked good on paper. Because I'm wired to build structure where there isn't any.

I moved to Boston in 2024 to do my Master's at Northeastern. Within a year I was doing two jobs at once under the same professor: building healthcare AI pipelines in production on GCP, and designing two graduate data engineering courses from scratch for 80+ students.

My philosophy is simple. I build systems that work after the demo ends. Every project has validation, observability, and a clear answer to "what happens when this breaks at 2am?" That's the line between data engineering and data science theater.

I graduate May 2026. I'm in an intensive job search right now — looking for a team that takes data infrastructure or applied AI seriously. If that's you, let's talk.

Ship before polish

A deployed pipeline with rough edges beats a perfect notebook. I bias toward production — Cloud Run, Airflow DAGs, Kubernetes. Not localhost demos.

II.

Validation is not optional

Schema validation, PII controls, audit trails, DVC versioning. Data without quality guarantees is just expensive noise.

III.

Teach what you know

I designed and taught two grad courses from scratch. Teaching forces clarity — if I can't explain it to 80 engineers, I don't understand it well enough.

IV.

Context-first systems

My AI systems carry state forward and generate follow-up questions, not one-shot answers. That's the difference between a chatbot and a tool.

02 — Latest Ship

Now in production.

Just completed — a clinical AI reasoning engine for rare disease differential diagnosis. Not a chatbot. A diagnostic workflow engine.

Shipped · v1.0 47 Commits · Solo build · 2026

MedAssist.AI

A symptom-to-rare-disease reasoning engine. Enter symptoms, traverse the disease knowledge graph, surface Orpha-coded candidates with targeted follow-up questions.

Snowflake-native analytics with KG_NODES / KG_EDGES / KG_BUILD_META tables, Snowflake Cortex LLM functions, dbt marts, and an agentic LangGraph assessment path. Encodes domain expertise over 8,700+ HPO ontology terms across 9 authoritative medical sources — Orphanet, PubMed, PMC, OpenFDA, RxNorm, WHO, NCBI Bookshelf, OpenStax, and an inverted symptom index.

Three-tier evidence retrieval: local warehouse (ChromaDB semantic + Snowflake ILIKE) → live APIs (NCBI E-utilities, Europe PMC) → constrained web fallback. Evidence-quality gate refuses to fabricate and returns insufficient_evidence + follow-up questions when the signal is weak. Streamlit app, FastAPI backend, Airflow DAGs, Prometheus metrics, MLflow tracking, optional self-critique pass.

Snowflake Cortex LangGraph Knowledge Graph 3-Tier Retrieval Vertex AI Gemini Claude API dbt Streamlit FastAPI Airflow ChromaDB MLflow Pydantic

View on GitHub ↗

8,700+ HPO ontology terms

9 Medical data sources

3-tier Evidence retrieval

97% Schema compliance

85% Error reduction at ingest

1,200+ EMR cases processed

867 Clinical records

03 — Selected Work

Projects that ship.

Filter by track or scroll for everything. The ones on the left are bigger for a reason.

AI · Agents 81 COMMITS · 3 CONTRIBUTORS

ORBIT AI-50 Intelligence

Agentic LLM platform for private-equity-grade due diligence on the Forbes AI 50.

Airflow DAGs ingest 500+ documents per run into GCS, chunked at 800 tokens / 100 overlap into 26,985 vectors (384-dim, all-MiniLM-L6-v2) in Qdrant. A 5-node LangGraph workflow — Planner → Data Generator → Evaluator → Risk Detector → HITL — uses the ReAct pattern with full execution traces. An MCP server exposes the intelligence as tools. Side-by-side RAG vs Structured (Pydantic) pipelines scored head-to-head at 8.17/10.

Forbes AI-50 companies

26,985

Qdrant vectors

500+

Docs per run

8.17/10

LLM-as-judge score

96%

Pipeline success rate

80%

Research time cut

LangGraphMCPReActHITL QdrantAirflowCloud Composer Cloud RunFastAPIInstructorPydantic

GitHub Demo 1 Demo 2

AI · Agents 64 COMMITS

TechScopeAI

Multi-agent intelligence for technical startup founders.

Seven specialized agents (Pitch, Competitive, Marketing, Patent, Policy, Team + Coordinator) on a shared LangGraph state. Weaviate Cloud with HNSW indexing. GPT-4 ↔ Gemini failover. MCP tools: DuckDuckGo, Pexels, USPTO, web extraction. Pitch agent integrates Gamma.ai across 5 themes. React + TypeScript + Vite frontend, FastAPI backend.

Specialized agents

1.6M+

Indexed chunks

Weaviate collections

2×

LLM failover

LangGraphMCPWeaviateReactTypeScriptFastAPIGamma.ai

GitHub Demo App

AI · RAG 69 COMMITS

Aurelia Financial RAG

Cloud-native RAG over a 4,000-page Financial Toolbox.

Code-aware chunking with custom separator priority (1200-char chunks, 200 overlap) — 180/180 MATLAB code blocks preserved intact, 100% metadata preservation for citations. text-embedding-3-large at 3072 dimensions, Instructor-validated Pydantic outputs, Wikipedia fallback when concepts aren't in the PDF. Cloud Composer weekly refresh at $0.0024 per query.

4,000pg

Financial corpus

580

Embeddings

2-3s

Cached latency

65%

QA cycle cut

LangChainChromaDBInstructorPydanticFastAPIApp Engine

GitHub Demo

AI · MLOps In Progress

Project Polaris

End-to-end MLOps for real-time sentiment at production scale.

Complete MLOps lifecycle. Logistic Regression, Random Forest, XGBoost tracked via MLflow — winning model served by FastAPI on Kubernetes. Kafka live streaming, Prometheus + Grafana for latency / throughput / drift, GitHub Actions CI/CD.

Models tracked

K8s

Served on cluster

Live

Kafka streaming

CI/CD

Auto-deploy

MLflowKubernetesKafkaFastAPIXGBoostPrometheusGrafanaDocker

Coming soon

Analytics 63 COMMITS

IMDb Warehouse

Star schema over 90M rows, 10× query speedup.

Fact_Title_Ratings at proper grain + 5 dimensions + 2 bridge tables. Eliminates row explosion and metric inflation on many-to-many joins. ADF + Alteryx + Snowflake + Power BI.

90M

Rows

14GB

Raw data

10×

Query speedup

5+2

Dims + bridges

SnowflakeADFdbtPower BIAlteryx

GitHub

Data Engineering 108 COMMITS

Food Inspections

First unified Chicago + Dallas health-safety model.

Two cities. Incompatible schemas (17 cols vs 114 cols). Different risk systems — categorical labels vs numeric scores. City-aware Alteryx ETL resolves the semantic mismatch into one trustworthy fact table.

200K+

Records per city

Cities unified

17→114

Column spread

6+1

Dims + bridge

SnowflakeAlteryxPower BISQL

GitHub

Analytics PRIVATE

NYPD Arrest Analytics

5M+ records, MERGE-based incremental refresh.

Snowflake dimensional model. Power BI trends by demographics, offense type, precinct, time. Designed for ongoing policy evaluation, not static reports.

5M+

Arrest records

Analysis axes

MERGE

Incremental load

★

Star schema

SnowflakePower BIAlteryxADF

Private repo

Data Engineering 91 COMMITS

Report Intelligence

SEC 10-K / 10-Q parsing with dual open-source engines + cloud benchmark.

Two complementary parsers: pdfplumber for speed, Docling for layout-aware extraction with reading-order and bounding-box provenance. Optional Google Document AI benchmark. XBRL cross-checks catch scaling mismatches (e.g., "in millions"). Full DVC reproducibility with metrics tracked.

2+1

Parser engines

XBRL

Validation layer

DVC

Reproducible

WER/F1

Quality metrics

DoclingpdfplumberDVCDocument AITesseractFAISS

GitHub Demo

Data Engineering 34 COMMITS

Project Lantern

Automated Dow-30 earnings pipeline with 3-tier Selenium scraping.

100% IR discovery across all 30 Dow Jones companies via hybrid strategy (subdomain patterns, homepage anchors, pattern guessing, DuckDuckGo fallback). Three-tier scraping: Requests → Selenium navigation → aggressive DOM manipulation. Docling parses top pages; Instructor + GPT-4 extracts 26 metadata fields per doc. Airflow to GCS.

30/30

IR discovery

Tables extracted

Metadata fields

95%

Filing accuracy

AirflowSeleniumDoclingInstructorGPT-4GCS

GitHub Demo

04 — Experience

The journey.

Four roles. Two countries. One throughline — systems that work at scale.

Sep 2025 — Present

Boston, MA

Research Assistant & Graduate Teaching Assistant

Northeastern University · D'Amore-McKim School of Business

Research: Building production GCP pipelines (Cloud Composer, BigQuery, PySpark) processing 867 clinical records + 1,200+ EMR cases. Multi-stage LLM extraction with schema validation and PII controls — 97% structured output compliance, 85% error reduction.

Teaching: Designed Database Management (SQL/Oracle) and Data Integration courses from scratch — curriculum, labs, assignments — then delivered both to 80+ graduate students. Both roles under the same professor.

AirflowBigQueryPySparkClaude APISnowflakedbt

Dual Role

Healthcare AI

Curriculum Design

80+ Students

Jan 2024 — May 2024

Gandhinagar, India

Research Intern — Cross-Lingual NLP

Dhirubhai Ambani Institute of Information & Communication Technology (DAIICT)

Built a cross-lingual information retrieval system translating English queries into Hindi, Gujarati, and Bengali. NLP pipelines for query translation, semantic matching, and cross-language relevance ranking. Genuine research — no existing solution to benchmark against.

NLPPythonInformation RetrievalMultilingual Systems

Research

NLP

Multilingual

May 2023 — Jul 2023

Vadodara, India

Technical Intern

Nifty Solutions

Built SQL reporting tables processing 30GB of sales data with automated weekly refresh — standardizing KPIs across 6 Power BI and Tableau dashboards. Refactored join logic, eliminated duplicate-driven overcounting, cut dashboard failures by 40%. Built an internal real-time monitoring tool that improved decision-making efficiency by 35%.

Power BITableauSQLMS SQL Server

Internship

35% Efficiency Gain

40% Fewer Failures

Data & AI systems engineer.

Who I am.

Now in production.

MedAssist.AI

Projects that ship.

ORBIT AI-50 Intelligence

TechScopeAI

Aurelia Financial RAG

Project Polaris

IMDb Warehouse

Food Inspections

NYPD Arrest Analytics

Report Intelligence

Project Lantern

The journey.

Research Assistant & Graduate Teaching Assistant

Research Intern — Cross-Lingual NLP

Technical Intern

Technical skills.

Academic background.

Leadership & community.

Director of Execution & Finance

Class President · 4 Years

Volunteer

Let's build something real.