LinkRivers is built from the ground up to be a web operations platform that scales from a single portfolio site to thousands of enterprise domains. This document outlines the key architectural decisions, infrastructure choices, and ML systems that make that possible.
Architecture at a Glance
Our architecture follows a microservices pattern deployed across multiple availability zones. The core principles:
- Stateless compute layer - All application servers are stateless and horizontally scalable
- Event-driven data pipeline - Real-time events flow through a message queue for async processing
- Multi-region monitoring - Uptime checks run from 30+ global locations simultaneously
- Edge-first analytics - Agent Lite processes data at the edge before sending to our collectors
- ML inference at scale - Predictions run on dedicated GPU infrastructure with sub-100ms latency
Design philosophy: We optimize for reliability first, then speed, then cost. A monitoring platform that goes down is worse than no monitoring at all.
Infrastructure Stack
Compute
Railway + Containers
Database
PostgreSQL
Cache
Redis
Queue
BullMQ
ML Runtime
Python + GPU inference
Frontend
React + Vite
Why These Choices
PostgreSQL handles our transactional workloads and time-series data. We use table partitioning for metrics data, keeping recent data hot while archiving older data to cold storage. This gives us fast queries on recent data while maintaining a full audit trail.
Redis serves multiple purposes: session storage, rate limiting, real-time feature flags, and as a caching layer for expensive database queries. Our cache invalidation strategy is event-driven - when underlying data changes, we publish an event that triggers cache eviction.
BullMQ powers our job queue system. Every monitoring check, notification dispatch, and ML inference job flows through the queue. This decouples our API layer from compute-intensive work and provides natural backpressure during traffic spikes.
Monitoring Pipeline
When you add a site to LinkRivers, here's what happens:
- Site registration - We validate the domain, check DNS, and create baseline configuration
- Initial scan - Full technical audit: SSL, headers, performance, SEO, security
- Schedule creation - Monitoring jobs are distributed across our check nodes
- Baseline learning - First 72 hours establish normal patterns for anomaly detection
- Alert configuration - Default thresholds set based on site type and tier
Check Distribution
We run checks from 30+ locations worldwide. Each check is assigned to 3-5 locations simultaneously to eliminate false positives from regional network issues. A site is only marked "down" when multiple locations confirm the failure.
Minimizing false positives: We use multi-location verification, retry logic, and ML-based confirmation to keep false positive rates low.
ML Models
LinkRivers uses machine learning for three primary functions:
1. Anomaly Detection
We train proprietary models on each site's historical data. The model learns normal patterns for response time, availability, and traffic. When metrics deviate significantly, we flag it as an anomaly - often catching issues before they become full outages.
2. Predictive Forecasting
After 2-4 weeks of training on your data, time-series forecasting models predict metrics 24-72 hours ahead. This powers features like "SSL expires in 14 days" and "Traffic spike expected Thursday based on historical patterns."
3. Root Cause Analysis
When an incident occurs, our classification model analyzes the symptoms and suggests probable causes. This accelerates diagnosis from minutes to seconds for common failure modes.
Model Training Pipeline
- Data collection - Continuous ingestion of check results, metrics, and user events
- Feature engineering - Rolling aggregates, time-of-day encoding, trend extraction
- Training - Scheduled retraining on fresh data, triggered by drift detection
- Validation - Holdout testing against recent incidents
- Deployment - Blue-green model deployment with automatic rollback on accuracy drop
Agent Lite
Agent Lite is our client-side JavaScript that powers Real User Monitoring. Design goals:
- Minimal footprint - Core loader under 2KB gzipped
- Zero dependencies - Pure vanilla JavaScript, no frameworks
- Privacy-first - No cookies, no PII collection, IP anonymization
- Modular loading - Only loads modules the site actually uses
- Graceful degradation - Fails silently if blocked or errored
The agent captures Core Web Vitals (LCP, FID, CLS), navigation timing, resource loading, JavaScript errors, and user interactions. All data is batched and sent in compressed payloads to minimize network overhead.
Autopilot System
Autopilot is our automated remediation engine. When an issue is detected:
- Classification - Determine issue type and severity
- Action selection - Choose appropriate fix from the playbook
- Permission check - Verify user has enabled auto-fix for this issue type
- Execution - Apply the fix via integration APIs
- Verification - Confirm the fix resolved the issue
- Logging - Full audit trail of what was changed and why
Human oversight: Every Autopilot action is logged and reversible. Users can configure supervised mode where fixes are proposed but require approval before execution.
Security Model
Security is foundational, not an afterthought:
- Encryption everywhere - TLS 1.3 for transit, AES-256 for data at rest
- Zero trust architecture - Every request is authenticated and authorized
- Minimal data retention - We only store what's necessary for the service
- Regular audits - Penetration testing and security reviews
- GDPR compliant - Full data protection and privacy compliance
API Security
API keys are hashed before storage. Rate limiting is enforced at multiple layers: per-key, per-IP, and per-endpoint. All API calls are logged for audit purposes with automatic anomaly detection on access patterns.
Scalability
The architecture is designed to scale horizontally from day one. Key design principles:
- Stateless compute - Adding capacity is a matter of spinning up more containers
- Partitioned data - Time-series data is partitioned for efficient queries at scale
- Distributed checks - Monitoring load is distributed across global check nodes
- Queue-based processing - Async jobs provide natural backpressure during traffic spikes
This architecture allows us to grow with our customers without requiring fundamental changes.
What's Next
We're actively working on:
- Enhanced ML models - More accurate predictions with site-specific training
- Expanded integrations - Direct connections to more hosting and CDN providers
- Edge functions - Run custom checks at the edge for sub-second detection
- API v2 - GraphQL API for more flexible data access
Questions about our architecture? Get in touch - we love talking about this stuff.