Key Project
AnomalyForge — Real-Time Industrial Anomaly Detection
End-to-end ML-powered anomaly detection platform for industrial IoT data streams. Designed as a production-ready system with clear separation between ML training, model serving, backend API, and observability layers.
System Architecture
Four-layer architecture — hover a component for details, click to jump to its description.
Component Reference
Click any component in the diagram above to jump directly to its description.
Synthetic SCADA Data Generator
Generates realistic multi-sensor time-series data modelled on medium-voltage substation telemetry (voltage, current, transformer temperature, active/reactive power). Produces configurable anomaly patterns: gradual degradation, sudden spikes, sensor stuck-at-zero, and missing-data bursts. Designed as a pluggable data source with an adapter interface for real SCADA feeds.
Tech: Python, NumPy, Pandas
Feature Engineering
Transforms raw sensor streams into ML-ready features: temporal aggregations (rolling means, standard deviations), cross-sensor correlation features, time-of-day/day-of-week cyclical encodings, and lag features. Implements automated data-quality validation with acceptance criteria gating before downstream consumption.
Tech: Pandas, scikit-learn pipelines
PyTorch TFT Training
Trains a Temporal Fusion Transformer (TFT) for multi-horizon anomaly detection on multivariate time-series. Uses attention-based variable selection to identify which sensors drive each prediction, providing built-in interpretability for operations teams.
Tech: PyTorch, PyTorch Forecasting
Optuna Hyperparameter Tuning
Automates hyperparameter search (learning rate, hidden size, attention heads, dropout) using Bayesian optimisation with early pruning. Reduces manual tuning effort and ensures reproducible, auditable search runs.
Tech: Optuna, MLflow integration
MLflow Model Registry
Tracks all training experiments (parameters, metrics, artefacts) and manages model lifecycle stages (Staging → Production → Archived). Serves as the single source of truth for which model version is deployed where.
Tech: MLflow Tracking, MLflow Registry
DVC Artefact Versioning
Versions training datasets, feature sets, and model binaries alongside Git commits, ensuring full reproducibility of any historical training run. Stores large artefacts in remote storage (S3) while keeping lightweight pointers in Git.
Tech: DVC, AWS S3
Nginx Load Balancer
Distributes incoming inference requests across multiple FastAPI instances using round-robin with health-check awareness. Handles TLS termination and rate limiting at the edge.
Tech: Nginx, TLS
FastAPI Inference Service
Async Python API serving real-time anomaly predictions with sub-50ms p99 latency. Exposes three core endpoints: POST /predict for batch/single inference, WebSocket /alerts for real-time anomaly streaming, and GET /health for orchestrator probes. Implements request validation with Pydantic models.
Tech: FastAPI, Uvicorn, Pydantic
Redis Feature Cache
Caches pre-computed feature vectors and recent predictions to avoid redundant computation on repeated or overlapping time windows. Reduces p99 latency by eliminating database round-trips for hot-path requests.
Tech: Redis, aioredis
PostgreSQL Metadata Store
Persists inference audit logs, prediction history, alert records, and model-version metadata. Provides the query layer for retrospective analysis and compliance reporting.
Tech: PostgreSQL, asyncpg, SQLAlchemy
Docker / Kubernetes
Containerises every service (API, cache, database, monitoring) with multi-stage Docker builds. Kubernetes manifests define deployments, services, horizontal pod autoscalers, and readiness/liveness probes for production orchestration.
Tech: Docker, Docker Compose, Kubernetes
Terraform IaC
Provisions all AWS infrastructure (ECS clusters, ECR repositories, RDS instances, ElastiCache, ALB, VPC networking) as declarative, version-controlled modules. Enables reproducible environment creation and teardown.
Tech: Terraform, AWS Provider
GitHub Actions CI/CD
Orchestrates the full delivery pipeline: lint and unit tests → model validation gate (AUC threshold check) → Docker image build and security scan → blue-green deployment to staging → promotion to production with automated rollback on health-check failure.
Tech: GitHub Actions, Trivy, pytest
Prometheus Metrics
Scrapes inference latency (p50/p95/p99), request throughput, error rates, and custom model-drift metrics from FastAPI's /metrics endpoint. Stores time-series data for alerting and dashboarding.
Tech: Prometheus, prometheus-fastapi-instrumentator
Grafana Dashboards
Visualises operational health across three dashboard panels: inference performance (latency histograms, throughput), model quality (drift scores, prediction distribution shifts), and infrastructure utilisation (CPU, memory, pod count). Pre-built JSON dashboard definitions are version-controlled in the repository.
Tech: Grafana, JSON dashboard provisioning
Airflow Retraining DAG
Triggers automated model retraining when drift alerts exceed defined thresholds. Orchestrates the full retraining cycle: fresh data extraction → feature engineering → training → validation → registry promotion → deployment. Includes human-in-the-loop approval gate for production promotion.
Tech: Apache Airflow, DAG definitions