Try It Yourself
Enter raw transaction descriptions to see AI-powered enrichment vs. rule-based extraction.
Open Banking Transaction Enrichment
When you connect your bank account through Open Banking APIs, raw transaction data comes in messy formats like
"POS 4839 STARBUCKS COFFEE MILANO 02/11 IT". This system automatically cleans that data,
identifies the merchant name ("Starbucks Coffee"), assigns a spending category ("Food"), and flags patterns
like recurring subscriptions or unusual transactions.
Banks and fintech apps need to show users clean, categorized transaction histories. Without automation, this requires either manual labeling or rigid rule-based systems that break when merchants use abbreviations, foreign languages, or inconsistent formatting. Simple pattern matching fails on unseen merchant names and produces low-quality results that erode user trust.
This project shows practical NLP application to real-world financial data processing. It demonstrates text normalization, entity extraction, semantic categorization, and pattern detection—core skills for data engineering and machine learning roles. The implementation handles multi-language input, noise filtering, and confidence scoring, producing structured JSON output suitable for downstream analytics or user interfaces.
The system enriches Open Banking transaction descriptions by extracting structured data from unstructured text. It processes transactions in real-time, maintaining low latency suitable for user-facing applications. The scope includes merchant name extraction, category classification, signal detection (recurring payments, anomalies), and confidence scoring.
Built as a client-side JavaScript library to minimize latency and server costs. The modular pipeline
separates concerns: cleanDescription() handles tokenization and stopword removal,
extractMerchant() performs entity recognition, categorize() runs semantic
matching, and detectSignals() identifies transaction patterns. Each stage is independently
testable and can be replaced with more sophisticated models (e.g., transformer-based NER) without
rewriting the entire pipeline.
Raw transaction description → Noise removal (POS codes, dates, card numbers) → Tokenization → Stopword filtering (multi-language) → Merchant extraction (capitalization normalization) → Category classification (keyword-based semantic matching with weighted scoring) → Signal detection (pattern matching for subscriptions, anomalies) → Confidence calculation → Structured JSON output.
Merchant extraction uses regex to strip technical noise (POS codes, card references), then applies stopword filtering across four languages (EN, IT, ES, FR, DE). It selects the first 2-3 meaningful tokens as the merchant name, applying capitalization rules for consistent output.
Category classification uses weighted keyword matching. Each category (Home, Transport, Food, Shopping, Entertainment, Health) has a keyword dictionary and a weight factor. The system calculates a score for each category based on keyword presence, selecting the highest-scoring match. Weights prioritize certain categories (e.g., Home utilities get 1.2x weight vs. Shopping at 0.9x) to reflect domain knowledge about transaction frequency and importance.
Signal detection runs pattern matching for recurring subscriptions (Netflix, Spotify, etc.) and category-specific signals (food delivery, international travel). The system returns confidence scores based on merchant name quality, category match strength, and signal presence.
Challenge: Handling unseen merchants without training data.
Solution: Extract merchant names using position-based heuristics (first N tokens after noise removal)
rather than requiring a known merchant database. Trade-off: May misidentify merchants with unusual naming patterns,
but generalizes to any transaction description.
Challenge: Multi-language support without language detection.
Solution: Use a unified stopword list covering EN/IT/ES/FR/DE common banking terms. Trade-off:
Larger stopword set increases false positives for legitimate merchant names, but eliminates need for language
detection API calls.
Challenge: Balancing accuracy vs. inference speed.
Solution: Use keyword-based semantic matching instead of embedding similarity or ML models.
Trade-off: Lower accuracy on ambiguous transactions (e.g., "Amazon" could be shopping or entertainment),
but achieves <5ms inference time suitable for real-time UI updates.
Challenge: Confidence scoring without ground truth labels.
Solution: Heuristic-based confidence calculation using merchant name length, category match
presence, and signal count. Trade-off: Confidence scores are relative indicators rather than calibrated
probabilities, but provide useful ranking for downstream filtering.
JavaScript over Python: Transaction enrichment happens in user-facing web applications. Running inference client-side eliminates server round-trips (200-500ms latency reduction) and reduces hosting costs for high-volume transaction processing.
Rule-based NLP over ML models: No labeled training data available for this specific transaction format. Rule-based approach provides deterministic, explainable results that can be debugged and refined without retraining. Suitable for MVP and production environments where model deployment infrastructure is unavailable.
Keyword matching over embeddings: Transaction descriptions are short (5-15 tokens) with limited semantic variation. Keyword matching provides 80% of the accuracy at 1% of the computational cost. Embedding models (word2vec, transformers) would require 50-100MB models and 100ms+ inference times without proportional accuracy gains.
Client-side execution over server API: Privacy considerations—users may prefer transaction data not leaving their browser. Client-side processing enables offline functionality and reduces GDPR/PSD2 compliance complexity.
Enter raw transaction descriptions to see AI-powered enrichment vs. rule-based extraction.
See how the system handles different transaction formats and edge cases.