Baseline Monitoring and Alerts
This guide covers the intelligent baseline monitoring features, including statistical baseline calculation, anomaly detection, and automated alerting.
⚙️ Optional Feature - Baseline monitoring is disabled by default. Enable it in your initializer:
RailsErrorDashboard.configure do |config|
config.enable_baseline_alerts = true
config.baseline_alert_threshold_std_devs = 2.0 # Alert when >2 std devs above baseline
config.baseline_alert_severities = [:critical, :high] # Alert on these severities
config.baseline_alert_cooldown_minutes = 120 # 2 hours between alerts
end
Table of Contents
- Overview
- Baseline Calculation
- Anomaly Detection
- Automated Alerts
- Configuration
- Best Practices
- Troubleshooting
Overview
Baseline monitoring goes beyond simple spike detection by using statistical methods to:
- Calculate intelligent baselines - Not just averages, but statistically sound thresholds
- Detect anomalies - Identify when error rates exceed expected ranges
- Send proactive alerts - Notify teams before issues escalate
- Track trends - Monitor how baselines evolve over time
Why Baselines Matter
Simple thresholds (e.g., “alert if >100 errors/hour”) don’t work because:
- Normal error rates vary by time of day, day of week, and season
- What’s normal for one error type may be abnormal for another
- Static thresholds cause false positives (alert fatigue) or miss real issues
Baselines solve this by establishing dynamic, context-aware thresholds.
Baseline Calculation
What is a Baseline?
A baseline is the expected normal range for an error type, calculated from historical data using statistical methods.
For each error type and platform combination, we track:
- Mean: Average error count
- Standard Deviation: How much variation is normal
- Percentiles: 95th and 99th percentile values
- Sample Size: How many data points were used
Baseline Types
We calculate three types of baselines:
1. Hourly Baseline
- Lookback: Last 4 weeks
- Granularity: By hour of day (0-23)
- Use Case: Detect unusual spikes during specific hours
- Example: “Between 2-3 PM, we normally see 10-20 errors, but today we saw 150”
2. Daily Baseline
- Lookback: Last 12 weeks
- Granularity: By day of week (Mon-Sun)
- Use Case: Detect unusual daily patterns
- Example: “Mondays usually have 500 errors, but this Monday had 2000”
3. Weekly Baseline
- Lookback: Last 52 weeks (1 year)
- Granularity: By week number
- Use Case: Detect long-term trends and seasonal changes
- Example: “This week’s error rate is 3x higher than the same week last year”
Statistical Method
We use a robust statistical approach that handles outliers:
# Simplified algorithm
def calculate_baseline(error_counts)
# 1. Remove extreme outliers (> 3 standard deviations)
mean = error_counts.mean
std_dev = error_counts.standard_deviation
filtered = error_counts.reject { |count| count > mean + (3 * std_dev) }
# 2. Recalculate statistics on filtered data
baseline_mean = filtered.mean
baseline_std_dev = filtered.standard_deviation
# 3. Calculate percentiles
percentile_95 = filtered.percentile(95)
percentile_99 = filtered.percentile(99)
{
mean: baseline_mean,
std_dev: baseline_std_dev,
percentile_95: percentile_95,
percentile_99: percentile_99,
sample_size: filtered.count
}
end
How Baselines are Calculated
The BaselineCalculator service runs daily via background job:
# Triggered automatically
RailsErrorDashboard::BaselineCalculationJob.perform_later
# Or manually via console
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines
Process:
- For each unique (error_type, platform) pair:
- Fetch error counts for lookback period
- Group by time unit (hour/day/week)
- Calculate statistics (mean, std_dev, percentiles)
- Store in
error_baselinestable - Update existing baselines or create new ones
Performance: Full recalculation takes ~5-10 minutes for 10,000 errors.
Anomaly Detection
What is an Anomaly?
An anomaly occurs when the current error count significantly exceeds the baseline.
We use a standard deviation-based approach:
current_count > baseline_mean + (threshold * std_dev)
Severity Levels
Anomalies are classified by how far above the baseline they are:
| Severity | Threshold | Description | Action |
|---|---|---|---|
| Normal | < 2 std devs | Within expected range | No alert |
| Elevated | 2-3 std devs | Moderately above normal | Monitor |
| High | 3-4 std devs | Significantly above normal | Investigate |
| Critical | > 4 std devs | Extremely abnormal | Immediate action |
Example Calculation
Baseline for "NoMethodError" on iOS, 2-3 PM:
- Mean: 15 errors
- Std Dev: 5 errors
- Percentile 95: 23 errors
Current count: 35 errors
Calculation:
35 - 15 = 20 errors above mean
20 / 5 = 4 standard deviations
Severity: Critical (> 4 std devs)
Accessing Anomaly Data
# Get current anomalies
stats = RailsErrorDashboard::Queries::BaselineStats.new
anomalies = stats.current_anomalies(severity: [:high, :critical])
anomalies.each do |anomaly|
puts "#{anomaly[:error_type]} on #{anomaly[:platform]}"
puts " Current: #{anomaly[:current_count]}"
puts " Baseline: #{anomaly[:baseline_mean]} ± #{anomaly[:baseline_std_dev]}"
puts " Severity: #{anomaly[:severity]} (#{anomaly[:std_devs_above]} std devs)"
end
Dashboard Integration
Anomalies are automatically displayed:
- Dashboard: Anomaly alerts card shows active anomalies
- Error Show Page: Baseline comparison chart
- Analytics: Trend charts with baseline ranges
Automated Alerts
Overview
Baseline alerts proactively notify your team when errors exceed baselines, before they become critical issues.
Alert Configuration
Enable baseline alerts in your initializer:
# config/initializers/rails_error_dashboard.rb
RailsErrorDashboard.configure do |config|
# Enable baseline alerting
config.enable_baseline_alerts = true
# Alert threshold (standard deviations above mean)
config.baseline_alert_threshold_std_devs = 2.0 # Default: 2.0
# Which severities to alert on
config.baseline_alert_severities = [:critical, :high] # Default: [:critical, :high]
# Cooldown period between alerts for same error type (minutes)
config.baseline_alert_cooldown_minutes = 120 # Default: 120 (2 hours)
# Alert channels (same as error notifications)
config.enable_slack_notifications = true
config.slack_webhook_url = ENV['SLACK_WEBHOOK_URL']
config.enable_email_notifications = true
config.notification_email = "errors@example.com"
end
Alert Triggers
Alerts are sent when:
- ✅ Baseline alerting is enabled
- ✅ Error count exceeds threshold (e.g., mean + 2 std devs)
- ✅ Severity matches configured severities
- ✅ Cooldown period has elapsed since last alert
Alert Payload
{
"alert_type": "baseline_violation",
"error_type": "NoMethodError",
"platform": "iOS",
"current_count": 35,
"baseline_mean": 15,
"baseline_std_dev": 5,
"std_devs_above": 4.0,
"severity": "critical",
"time_period": "2-3 PM",
"baseline_type": "hourly",
"trend": "increasing",
"dashboard_url": "https://yourapp.com/errors/123"
}
Alert Channels
Baseline alerts use the same notification channels as error notifications:
- Slack: Rich message with charts and links
- Email: HTML email with details and dashboard link
- Discord: Webhook notification
- PagerDuty: Incident creation for critical alerts
- Custom Webhook: POST JSON to your endpoint
Cooldown Mechanism
To prevent alert fatigue, alerts are throttled:
# Check if alert should be sent
RailsErrorDashboard::Services::BaselineAlertThrottler.should_alert?(
error_type: "NoMethodError",
platform: "iOS"
)
# => false if alert sent within cooldown period
Implementation:
- Last alert time stored in Redis (if available) or database
- Key:
baseline_alert:#{error_type}:#{platform} - Expires after cooldown period
Manual Alert Testing
Test your alert configuration:
# Send a test baseline alert
RailsErrorDashboard::BaselineAlertJob.perform_now(
error_type: "TestError",
platform: "iOS",
current_count: 100,
baseline_mean: 20,
baseline_std_dev: 10,
severity: :critical
)
Configuration
Full Configuration Reference
RailsErrorDashboard.configure do |config|
# === Baseline Calculation ===
# How far back to look for baseline calculation
config.baseline_lookback_weeks = 4 # Hourly baselines
config.baseline_lookback_weeks_daily = 12 # Daily baselines
config.baseline_lookback_weeks_weekly = 52 # Weekly baselines
# Minimum sample size for valid baseline
config.baseline_min_sample_size = 10
# === Anomaly Detection ===
# Outlier removal threshold (std devs)
config.baseline_outlier_threshold = 3.0
# Anomaly severity thresholds (std devs above mean)
config.baseline_elevated_threshold = 2.0
config.baseline_high_threshold = 3.0
config.baseline_critical_threshold = 4.0
# === Alerts ===
# Enable/disable baseline alerting
config.enable_baseline_alerts = true
# Alert threshold (std devs)
config.baseline_alert_threshold_std_devs = 2.0
# Alert severities
config.baseline_alert_severities = [:critical, :high]
# Cooldown period (minutes)
config.baseline_alert_cooldown_minutes = 120
# Alert channels (see notification configuration)
config.enable_slack_notifications = true
config.slack_webhook_url = ENV['SLACK_WEBHOOK_URL']
end
Tuning Recommendations
For Low-Traffic Applications
config.baseline_alert_threshold_std_devs = 3.0 # More lenient
config.baseline_alert_cooldown_minutes = 60 # Shorter cooldown
config.baseline_min_sample_size = 5 # Lower minimum
For High-Traffic Applications
config.baseline_alert_threshold_std_devs = 2.0 # Stricter
config.baseline_alert_cooldown_minutes = 180 # Longer cooldown
config.baseline_min_sample_size = 20 # Higher minimum
For Noisy Error Types
# Option 1: Exclude from alerting
config.baseline_alert_severities = [:critical] # Only critical
# Option 2: Use higher threshold for specific types
# (Custom logic in BaselineAlertJob)
Best Practices
1. Start with Conservative Settings
Begin with stricter thresholds and relax them as needed:
config.baseline_alert_threshold_std_devs = 3.0 # Start strict
config.baseline_alert_severities = [:critical] # Only critical
config.baseline_alert_cooldown_minutes = 180 # Longer cooldown
Gradually tune based on your team’s needs.
2. Monitor Baseline Health
Check baselines weekly:
stats = RailsErrorDashboard::Queries::BaselineStats.new
# Check which error types have baselines
baseline_coverage = stats.baseline_coverage
# => { "NoMethodError" => 80%, "ArgumentError" => 60% }
# Identify stale baselines (not updated recently)
stale_baselines = ErrorBaseline.where("updated_at < ?", 7.days.ago)
3. Handle Seasonal Changes
Baselines adapt over time, but sudden changes (e.g., holiday traffic) may cause false alerts:
Solution: Temporarily adjust thresholds:
# During known high-traffic events
config.baseline_alert_threshold_std_devs = 4.0
Or disable alerts for specific periods:
# In config/initializers/rails_error_dashboard.rb
config.enable_baseline_alerts = ENV['BASELINE_ALERTS_ENABLED'] != 'false'
# Then in production:
export BASELINE_ALERTS_ENABLED=false # Temporarily disable
4. Combine with Other Monitoring
Baseline alerts complement (not replace) other monitoring:
- Application Performance Monitoring (APM): Datadog, New Relic, etc.
- Uptime Monitoring: Pingdom, StatusCake, etc.
- Business Metrics: Revenue, conversions, etc.
Use baseline alerts as early warning signals before issues impact users.
5. Review Alert History
Periodically review alerts to tune configuration:
# Get alert history (from your logging system)
recent_alerts = BaselineAlertJob.where("created_at > ?", 30.days.ago)
# Analyze:
# - Are there frequent false positives? (increase threshold)
# - Are we missing real issues? (decrease threshold)
# - Is cooldown too short? (team fatigued by repeated alerts)
6. Use Baselines for Capacity Planning
Baselines reveal traffic patterns:
stats = RailsErrorDashboard::Queries::BaselineStats.new
hourly_baselines = stats.hourly_baseline("NoMethodError", "iOS")
# Find peak hours
peak_hours = hourly_baselines
.select { |hour, data| data[:mean] > overall_mean }
.keys
# => [9, 10, 11, 14, 15, 16] # Business hours
Use this to:
- Scale infrastructure during peak hours
- Schedule deployments during low-traffic periods
- Plan maintenance windows
Troubleshooting
“No baseline available for error type”
Cause: Not enough historical data to calculate baseline.
Requirements:
- At least
baseline_min_sample_sizedata points (default: 10) - Data must span the lookback period (e.g., 4 weeks for hourly)
Solution:
# Check sample size
ErrorLog.where(error_type: "YourError").count
# If < 10, wait for more data
# Lower minimum if needed (not recommended)
config.baseline_min_sample_size = 5
“Baseline alerts not sending”
Checklist:
- ✅ Alerts enabled:
config.enable_baseline_alerts = true - ✅ Notification channel configured:
config.enable_slack_notifications = true - ✅ Severity matches:
config.baseline_alert_severitiesincludes current severity - ✅ Cooldown expired: Check last alert time
- ✅ Threshold exceeded: Current count > mean + (threshold * std_dev)
Debug:
# Check baseline exists
baseline = ErrorBaseline.find_by(error_type: "YourError", platform: "iOS")
baseline.present? # Should be true
# Check cooldown
throttler = RailsErrorDashboard::Services::BaselineAlertThrottler
throttler.should_alert?(error_type: "YourError", platform: "iOS")
# => true if alert should be sent
# Manually trigger alert
RailsErrorDashboard::BaselineAlertJob.perform_now(...)
“Too many false positive alerts”
Causes:
- Threshold too sensitive
- High natural variance in error rates
- Insufficient historical data
Solutions:
- Increase threshold:
config.baseline_alert_threshold_std_devs = 3.0 # Was 2.0 - Alert only on critical:
config.baseline_alert_severities = [:critical] # Was [:critical, :high] - Increase cooldown:
config.baseline_alert_cooldown_minutes = 240 # Was 120 (4 hours) - Exclude noisy error types:
# In BaselineAlertJob (custom modification) return if error_type.in?(["CommonWarning", "ExpectedError"])
“Alerts missing real issues”
Causes:
- Threshold too lenient
- Gradual increases not detected (boiling frog problem)
- Baseline not updated recently
Solutions:
- Decrease threshold:
config.baseline_alert_threshold_std_devs = 1.5 # Was 2.0 - Alert on more severities:
config.baseline_alert_severities = [:critical, :high, :elevated] - Recalculate baselines:
# Force recalculation RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines(force: true) - Monitor trends manually:
stats = RailsErrorDashboard::Queries::BaselineStats.new trends = stats.error_trends(days: 30) # Look for gradual increases
“Baselines seem inaccurate”
Causes:
- Recent code changes altered normal behavior
- Seasonal patterns not yet learned
- Outliers not properly filtered
Solutions:
- Reset baselines after major changes:
```ruby
Delete old baselines for affected error types
ErrorBaseline.where(error_type: “AffectedError”).delete_all
Recalculate will use only recent data
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines
2. **Adjust outlier threshold**:
```ruby
config.baseline_outlier_threshold = 2.5 # Was 3.0 (more aggressive filtering)
- Use shorter lookback for new features:
# Temporarily use shorter period config.baseline_lookback_weeks = 2 # Was 4
Database Schema
ErrorBaseline Table
create_table :rails_error_dashboard_error_baselines do |t|
t.string :error_type, null: false
t.string :platform
t.string :baseline_type, null: false # "hourly", "daily", "weekly"
t.datetime :period_start, null: false
t.datetime :period_end, null: false
# Statistical measures
t.float :mean
t.float :std_dev
t.float :percentile_95
t.float :percentile_99
t.integer :sample_size
t.timestamps
end
add_index :rails_error_dashboard_error_baselines,
[:error_type, :platform, :baseline_type, :period_start],
name: "index_error_baselines_on_type_platform_baseline_period"
API Reference
BaselineStats Query Object
stats = RailsErrorDashboard::Queries::BaselineStats.new
# Get baseline for specific error type
baseline = stats.hourly_baseline("NoMethodError", "iOS")
# => { mean: 15, std_dev: 5, percentile_95: 23, ... }
# Get current anomalies
anomalies = stats.current_anomalies(severity: [:high, :critical])
# => [{ error_type: "...", severity: :critical, std_devs_above: 4.2 }]
# Check if current count is anomalous
is_anomaly = stats.is_anomaly?(
error_type: "NoMethodError",
platform: "iOS",
current_count: 35
)
# => { anomaly: true, severity: :critical, std_devs_above: 4.0 }
BaselineCalculator Service
# Calculate all baselines
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines
# Calculate for specific error type
RailsErrorDashboard::Services::BaselineCalculator.calculate_baseline(
error_type: "NoMethodError",
platform: "iOS",
baseline_type: "hourly"
)
BaselineAlertThrottler Service
throttler = RailsErrorDashboard::Services::BaselineAlertThrottler
# Check if alert should be sent
should_send = throttler.should_alert?(
error_type: "NoMethodError",
platform: "iOS"
)
# => true or false
# Record that alert was sent
throttler.record_alert(
error_type: "NoMethodError",
platform: "iOS"
)
Further Reading
- Advanced Error Grouping Guide - Fuzzy matching and cascades
- Platform Comparison Guide - iOS vs Android analysis
- Occurrence Patterns Guide - Cyclical and burst patterns
- Error Correlation Guide - Release and user correlation