Predictive Maintenance

NeuroGrid Fault Risk Scoring Platform

A production ML system that predicts electrical grid fault risk to prioritise preventive maintenance. Demonstrates end-to-end ML delivery: feature engineering, model training, CI/CD pipelines, versioned artefacts, and API serving.

CI/CD for ML Feature Engineering MLflow Tracking FastAPI Serving Docker Deployment

What This Project Does

This system predicts the probability of electrical faults on medium-voltage power grid assets over the next 30 days. It produces a risk score and classification (LOW, MEDIUM, HIGH, CRITICAL) to help maintenance teams decide which equipment needs attention first.

The problem it solves: electrical utilities manage thousands of assets (transformers, cables, circuit breakers) and need a data-driven way to prioritise inspections and repairs. Without this, maintenance is reactive—teams respond to failures after they occur, leading to outages, safety issues, and higher costs. This system enables proactive maintenance by identifying high-risk assets before they fail.

What it demonstrates: this is a complete ML system, not just a model. It shows how to take a business problem through the full ML lifecycle—data preparation, feature engineering, model training with experiment tracking (MLflow), automated testing, versioned model artefacts, containerised deployment, and a production API that other systems can integrate with. The live demo below connects to a real deployed service.

Technical Deep Dive

Purpose and Scope

This system implements a supervised classification model for fault risk prediction on medium-voltage electrical grid assets. The model consumes historical fault records, asset metadata, and time-based features to output a probability estimate and risk classification. The scope encompasses the entire ML pipeline: feature engineering, model training, experiment tracking, CI/CD for model deployment, and inference serving via REST API.

Architecture and Design Decisions

The architecture follows a modular pipeline pattern: ingestion → feature engineering → training → evaluation → artefact versioning → serving. Feature engineering transforms raw operational data (fault timestamps, asset attributes) into time-windowed aggregates (faults in last 30/90/180 days) and categorical encodings. The training module uses MLflow for experiment tracking and model registry, enabling reproducible experiments and version-controlled artefacts.

The serving layer separates model execution from the API interface: a FastAPI application loads the serialised model artefact at startup and exposes a /predict endpoint. This design allows model updates without changing API contracts. The system is deployed in Docker containers on Render, with a Cloudflare proxy for same-origin requests from this site.

Data Flow and Execution Model

Training flow: Raw fault logs → time-based feature extraction (days since last fault, rolling counts) → categorical feature encoding (asset type, region, manufacturer) → train/test split using time-based cutoff (prevents data leakage) → model training with cross-validation → MLflow logging (metrics, parameters, artefacts) → model serialisation to .pkl with versioned release tag.

Inference flow: Client sends JSON payload with feature values → FastAPI validates schema (Pydantic models) → model predicts fault probability → post-processing applies risk band thresholds (0-0.3: LOW, 0.3-0.6: MEDIUM, 0.6-0.8: HIGH, 0.8+: CRITICAL) → API returns JSON response with probability, risk band, and model metadata.

Technology Stack and Rationale

Python 3.11: Core language for ML and API. Chosen for mature ML ecosystem (scikit-learn, pandas) and type safety improvements (used with Pydantic for runtime validation).

scikit-learn: Model implementation. Provides production-ready classifiers (RandomForestClassifier used here), pipeline abstractions for feature transformations, and serialisation. Chosen over deep learning frameworks because the problem is tabular with limited data—simpler models generalise better and are easier to interpret.

MLflow: Experiment tracking and model registry. Logs hyperparameters, metrics (precision, recall, F1), and model artefacts for reproducibility. Enables comparison across training runs and promotes models from experimentation to production. Alternative (Weights & Biases) requires external service; MLflow can run self-hosted.

FastAPI: REST API framework. Automatic OpenAPI/Swagger documentation, request validation with Pydantic, async support (though not required here), and fast startup times. Chosen over Flask for better type safety and built-in docs. Chosen over Django for lower overhead—this is an inference service, not a full web application.

Pydantic: Data validation. Enforces schema at API boundary (input features must match expected types and ranges). Catches bad requests before they reach the model, preventing runtime errors and nonsensical predictions.

Docker: Containerisation. Ensures consistent environment between local development, CI, and production. The Dockerfile uses multi-stage builds to keep image size small (base Python + dependencies layer, then add application code).

GitHub Actions: CI/CD pipeline. Automates model training on push to main, runs tests, builds Docker image, and creates versioned releases with model artefacts. This ensures every model in production is traceable to a specific commit and training run.

Render: Deployment platform. Hosts the Dockerised FastAPI service. Chosen for simplicity (auto-deploy from GitHub) and free tier for demos. Limitation: cold starts on free tier cause 30-60s delay on first request after idle period.

Cloudflare Workers: Proxy layer. Routes requests from this static site to the Render service, avoiding CORS issues. Adds caching and DDoS protection.

Implementation Details

Feature engineering: time-based features use pandas date arithmetic to compute days since last event and rolling window counts. Categorical features are one-hot encoded with sklearn.preprocessing.OneHotEncoder. Missing values are imputed with median (numeric) or mode (categorical) to handle incomplete records.

Model: RandomForestClassifier with 100 trees, max depth 10, and class weighting to handle imbalanced classes (faults are rare events). Hyperparameters (n_estimators: 50-200, max_depth: 5-15, min_samples_split: 2-10) were tuned via grid search logged in MLflow. The model outputs calibrated probabilities using CalibratedClassifierCV to ensure probability estimates are reliable for decision-making.

Testing: unit tests verify feature engineering logic, integration tests check API responses, and smoke tests validate the deployed service. CI runs tests on every commit before building the Docker image.

Versioning: model artefacts are tagged with semantic versions (model-v0.1.0) and stored in GitHub Releases. The API reads the model version from a manifest file packaged in the Docker image, exposing it in the /health endpoint for observability.

Technical Challenges and Trade-offs

Data leakage prevention: Used time-based train/test split instead of random split. Random splitting would leak future information into training set (a fault on day 100 influences features for day 99), inflating metrics. Time-based split mirrors real deployment: model trained on past data, evaluated on future data.

Class imbalance: Fault events are rare (~5% positive class). Standard classifiers overpredict the majority class. Solution: class weighting (class_weight='balanced') and threshold tuning. Trade-off: higher recall (catch more faults) at the cost of precision (more false alarms). Acceptable because false alarm cost (unnecessary inspection) is lower than miss cost (unplanned outage).

Cold start latency: Render free tier spins down after inactivity. First request after idle takes 30-60s to wake up. Mitigation: health check pre-flight request in the demo form. Alternative (paid tier with always-on instances) eliminates this but adds cost. Acceptable trade-off for a demo project.

Model interpretability vs. performance: Random forests are less interpretable than logistic regression but provide better predictions on this data (non-linear relationships between features). For production deployment, added SHAP value logging (not exposed in this demo) to explain individual predictions. Trade-off: complexity for accuracy, mitigated with post-hoc explainability.

Deployment simplicity vs. scalability: Single-instance deployment on Render. Sufficient for demo load (~10 req/min) but would require horizontal scaling (multiple instances behind load balancer) for production traffic. Future improvement: migrate to Kubernetes or serverless (AWS Lambda) for auto-scaling. Current design prioritises ease of setup over scale.

Live Inference Demo

Score risk via a same-origin Cloudflare proxy (no CORS). If the upstream service was idle, the first request may take a moment to warm up.

Open Swagger