Key Project

AnomalyForge — Real-Time Industrial Anomaly Detection

End-to-end ML-powered anomaly detection platform for industrial IoT data streams. Designed as a production-ready system with clear separation between ML training, model serving, backend API, and observability layers.

PyTorch FastAPI PostgreSQL Redis Docker Kubernetes Terraform AWS MLflow Prometheus Grafana GitHub Actions

System Architecture

Four-layer architecture — hover a component for details, click to jump to its description.

LAYER 1 — DATA & TRAINING PIPELINE LAYER 2 — BACKEND API (Serving) LAYER 3 — INFRASTRUCTURE & CI/CD LAYER 4 — OBSERVABILITY & RETRAINING Synthetic SCADA Data Generator Feature Engineering PyTorch TFT Training MLflow Registry Optuna Hyperparameter Tuning DVC Artefact Versioning Nginx Load Balancer FastAPI Instance 1 FastAPI Instance 2 Redis Feature Cache PostgreSQL Metadata Store POST /predict (<50ms p99) WS /alerts (real-time) GET /health (liveness) Docker / K8s Container Orchestration Terraform IaC Infrastructure as Code AWS ECS · ECR · RDS · ElastiCache GitHub Actions CI/CD ├─ lint + test ├─ model validation gate ├─ container build + scan └─ blue-green deploy Prometheus Metrics Scraper Grafana Dashboards ├─ Inference latency ├─ Throughput └─ Model drift Alert Manager Threshold Routing Airflow DAG Auto-retrain on drift Human-in-loop gate → Registry promotion

Component Reference

Click any component in the diagram above to jump directly to its description.

Synthetic SCADA Data Generator

Generates realistic multi-sensor time-series data modelled on medium-voltage substation telemetry (voltage, current, transformer temperature, active/reactive power). Produces configurable anomaly patterns: gradual degradation, sudden spikes, sensor stuck-at-zero, and missing-data bursts. Designed as a pluggable data source with an adapter interface for real SCADA feeds.

Tech: Python, NumPy, Pandas

Feature Engineering

Transforms raw sensor streams into ML-ready features: temporal aggregations (rolling means, standard deviations), cross-sensor correlation features, time-of-day/day-of-week cyclical encodings, and lag features. Implements automated data-quality validation with acceptance criteria gating before downstream consumption.

Tech: Pandas, scikit-learn pipelines

PyTorch TFT Training

Trains a Temporal Fusion Transformer (TFT) for multi-horizon anomaly detection on multivariate time-series. Uses attention-based variable selection to identify which sensors drive each prediction, providing built-in interpretability for operations teams.

Tech: PyTorch, PyTorch Forecasting

Optuna Hyperparameter Tuning

Automates hyperparameter search (learning rate, hidden size, attention heads, dropout) using Bayesian optimisation with early pruning. Reduces manual tuning effort and ensures reproducible, auditable search runs.

Tech: Optuna, MLflow integration

MLflow Model Registry

Tracks all training experiments (parameters, metrics, artefacts) and manages model lifecycle stages (Staging → Production → Archived). Serves as the single source of truth for which model version is deployed where.

Tech: MLflow Tracking, MLflow Registry

DVC Artefact Versioning

Versions training datasets, feature sets, and model binaries alongside Git commits, ensuring full reproducibility of any historical training run. Stores large artefacts in remote storage (S3) while keeping lightweight pointers in Git.

Tech: DVC, AWS S3

Nginx Load Balancer

Distributes incoming inference requests across multiple FastAPI instances using round-robin with health-check awareness. Handles TLS termination and rate limiting at the edge.

Tech: Nginx, TLS

FastAPI Inference Service

Async Python API serving real-time anomaly predictions with sub-50ms p99 latency. Exposes three core endpoints: POST /predict for batch/single inference, WebSocket /alerts for real-time anomaly streaming, and GET /health for orchestrator probes. Implements request validation with Pydantic models.

Tech: FastAPI, Uvicorn, Pydantic

Redis Feature Cache

Caches pre-computed feature vectors and recent predictions to avoid redundant computation on repeated or overlapping time windows. Reduces p99 latency by eliminating database round-trips for hot-path requests.

Tech: Redis, aioredis

PostgreSQL Metadata Store

Persists inference audit logs, prediction history, alert records, and model-version metadata. Provides the query layer for retrospective analysis and compliance reporting.

Tech: PostgreSQL, asyncpg, SQLAlchemy

Docker / Kubernetes

Containerises every service (API, cache, database, monitoring) with multi-stage Docker builds. Kubernetes manifests define deployments, services, horizontal pod autoscalers, and readiness/liveness probes for production orchestration.

Tech: Docker, Docker Compose, Kubernetes

Terraform IaC

Provisions all AWS infrastructure (ECS clusters, ECR repositories, RDS instances, ElastiCache, ALB, VPC networking) as declarative, version-controlled modules. Enables reproducible environment creation and teardown.

Tech: Terraform, AWS Provider

GitHub Actions CI/CD

Orchestrates the full delivery pipeline: lint and unit tests → model validation gate (AUC threshold check) → Docker image build and security scan → blue-green deployment to staging → promotion to production with automated rollback on health-check failure.

Tech: GitHub Actions, Trivy, pytest

Prometheus Metrics

Scrapes inference latency (p50/p95/p99), request throughput, error rates, and custom model-drift metrics from FastAPI's /metrics endpoint. Stores time-series data for alerting and dashboarding.

Tech: Prometheus, prometheus-fastapi-instrumentator

Grafana Dashboards

Visualises operational health across three dashboard panels: inference performance (latency histograms, throughput), model quality (drift scores, prediction distribution shifts), and infrastructure utilisation (CPU, memory, pod count). Pre-built JSON dashboard definitions are version-controlled in the repository.

Tech: Grafana, JSON dashboard provisioning

Airflow Retraining DAG

Triggers automated model retraining when drift alerts exceed defined thresholds. Orchestrates the full retraining cycle: fresh data extraction → feature engineering → training → validation → registry promotion → deployment. Includes human-in-the-loop approval gate for production promotion.

Tech: Apache Airflow, DAG definitions