# Baseline Monitoring and Alerts

This guide covers the intelligent baseline monitoring features, including statistical baseline calculation, anomaly detection, and automated alerting.

**⚙️ Optional Feature** - Baseline monitoring is disabled by default. Enable it in your initializer:

```ruby
RailsErrorDashboard.configure do |config|
  config.enable_baseline_alerts = true
  config.baseline_alert_threshold_std_devs = 2.0  # Alert when >2 std devs above baseline
  config.baseline_alert_severities = [:critical, :high]  # Alert on these severities
  config.baseline_alert_cooldown_minutes = 120  # 2 hours between alerts
end
```

## Table of Contents

- [Overview](#overview)
- [Baseline Calculation](#baseline-calculation)
- [Anomaly Detection](#anomaly-detection)
- [Automated Alerts](#automated-alerts)
- [Configuration](#configuration)
- [Best Practices](#best-practices)
- [Troubleshooting](#troubleshooting)

## Overview

Baseline monitoring goes beyond simple spike detection by using statistical methods to:
- **Calculate intelligent baselines** - Not just averages, but statistically sound thresholds
- **Detect anomalies** - Identify when error rates exceed expected ranges
- **Send proactive alerts** - Notify teams before issues escalate
- **Track trends** - Monitor how baselines evolve over time

### Why Baselines Matter

Simple thresholds (e.g., "alert if >100 errors/hour") don't work because:
- Normal error rates vary by time of day, day of week, and season
- What's normal for one error type may be abnormal for another
- Static thresholds cause false positives (alert fatigue) or miss real issues

Baselines solve this by establishing **dynamic, context-aware thresholds**.

## Baseline Calculation

### What is a Baseline?

A baseline is the **expected normal range** for an error type, calculated from historical data using statistical methods.

For each error type and platform combination, we track:
- **Mean**: Average error count
- **Standard Deviation**: How much variation is normal
- **Percentiles**: 95th and 99th percentile values
- **Sample Size**: How many data points were used

### Baseline Types

We calculate three types of baselines:

#### 1. Hourly Baseline
- **Lookback**: Last 4 weeks
- **Granularity**: By hour of day (0-23)
- **Use Case**: Detect unusual spikes during specific hours
- **Example**: "Between 2-3 PM, we normally see 10-20 errors, but today we saw 150"

#### 2. Daily Baseline
- **Lookback**: Last 12 weeks
- **Granularity**: By day of week (Mon-Sun)
- **Use Case**: Detect unusual daily patterns
- **Example**: "Mondays usually have 500 errors, but this Monday had 2000"

#### 3. Weekly Baseline
- **Lookback**: Last 52 weeks (1 year)
- **Granularity**: By week number
- **Use Case**: Detect long-term trends and seasonal changes
- **Example**: "This week's error rate is 3x higher than the same week last year"

### Statistical Method

We use a **robust statistical approach** that handles outliers:

```ruby
# Simplified algorithm
def calculate_baseline(error_counts)
  # 1. Remove extreme outliers (> 3 standard deviations)
  mean = error_counts.mean
  std_dev = error_counts.standard_deviation
  filtered = error_counts.reject { |count| count > mean + (3 * std_dev) }

# 2. Recalculate statistics on filtered data
  baseline_mean = filtered.mean
  baseline_std_dev = filtered.standard_deviation

# 3. Calculate percentiles
  percentile_95 = filtered.percentile(95)
  percentile_99 = filtered.percentile(99)

{
    mean: baseline_mean,
    std_dev: baseline_std_dev,
    percentile_95: percentile_95,
    percentile_99: percentile_99,
    sample_size: filtered.count
  }
end
```

### How Baselines are Calculated

The `BaselineCalculator` service runs daily via background job:

```ruby
# Triggered automatically
RailsErrorDashboard::BaselineCalculationJob.perform_later

# Or manually via console
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines
```

**Process**:
1. For each unique (error_type, platform) pair:
2. Fetch error counts for lookback period
3. Group by time unit (hour/day/week)
4. Calculate statistics (mean, std_dev, percentiles)
5. Store in `error_baselines` table
6. Update existing baselines or create new ones

**Performance**: Full recalculation takes ~5-10 minutes for 10,000 errors.

## Anomaly Detection

### What is an Anomaly?

An anomaly occurs when the **current error count significantly exceeds the baseline**.

We use a **standard deviation-based approach**:

```text
current_count > baseline_mean + (threshold * std_dev)
```

### Severity Levels

Anomalies are classified by how far above the baseline they are:

| Severity | Threshold | Description | Action |
|----------|-----------|-------------|--------|
| **Normal** | < 2 std devs | Within expected range | No alert |
| **Elevated** | 2-3 std devs | Moderately above normal | Monitor |
| **High** | 3-4 std devs | Significantly above normal | Investigate |
| **Critical** | > 4 std devs | Extremely abnormal | Immediate action |

### Example Calculation

```text
Baseline for "NoMethodError" on iOS, 2-3 PM:
- Mean: 15 errors
- Std Dev: 5 errors
- Percentile 95: 23 errors

Current count: 35 errors

Calculation:
35 - 15 = 20 errors above mean
20 / 5 = 4 standard deviations

Severity: Critical (> 4 std devs)
```

### Accessing Anomaly Data

```ruby
# Get current anomalies
stats = RailsErrorDashboard::Queries::BaselineStats.new
anomalies = stats.current_anomalies(severity: [:high, :critical])

anomalies.each do |anomaly|
  puts "#{anomaly[:error_type]} on #{anomaly[:platform]}"
  puts "  Current: #{anomaly[:current_count]}"
  puts "  Baseline: #{anomaly[:baseline_mean]} ± #{anomaly[:baseline_std_dev]}"
  puts "  Severity: #{anomaly[:severity]} (#{anomaly[:std_devs_above]} std devs)"
end
```

### Dashboard Integration

Anomalies are automatically displayed:
- **Dashboard**: Anomaly alerts card shows active anomalies
- **Error Show Page**: Baseline comparison chart
- **Analytics**: Trend charts with baseline ranges

## Automated Alerts

### Overview

Baseline alerts proactively notify your team when errors exceed baselines, **before they become critical issues**.

### Alert Configuration

Enable baseline alerts in your initializer:

```ruby
# config/initializers/rails_error_dashboard.rb
RailsErrorDashboard.configure do |config|
  # Enable baseline alerting
  config.enable_baseline_alerts = true

# Alert threshold (standard deviations above mean)
  config.baseline_alert_threshold_std_devs = 2.0  # Default: 2.0

# Which severities to alert on
  config.baseline_alert_severities = [:critical, :high]  # Default: [:critical, :high]

# Cooldown period between alerts for same error type (minutes)
  config.baseline_alert_cooldown_minutes = 120  # Default: 120 (2 hours)

# Alert channels (same as error notifications)
  config.enable_slack_notifications = true
  config.slack_webhook_url = ENV['SLACK_WEBHOOK_URL']

config.enable_email_notifications = true
  config.notification_email = "errors@example.com"
end
```

### Alert Triggers

Alerts are sent when:
1. ✅ Baseline alerting is enabled
2. ✅ Error count exceeds threshold (e.g., mean + 2 std devs)
3. ✅ Severity matches configured severities
4. ✅ Cooldown period has elapsed since last alert

### Alert Payload

```json
{
  "alert_type": "baseline_violation",
  "error_type": "NoMethodError",
  "platform": "iOS",
  "current_count": 35,
  "baseline_mean": 15,
  "baseline_std_dev": 5,
  "std_devs_above": 4.0,
  "severity": "critical",
  "time_period": "2-3 PM",
  "baseline_type": "hourly",
  "trend": "increasing",
  "dashboard_url": "https://yourapp.com/errors/123"
}
```

### Alert Channels

Baseline alerts use the **same notification channels** as error notifications:

- **Slack**: Rich message with charts and links
- **Email**: HTML email with details and dashboard link
- **Discord**: Webhook notification
- **PagerDuty**: Incident creation for critical alerts
- **Custom Webhook**: POST JSON to your endpoint

### Cooldown Mechanism

To prevent **alert fatigue**, alerts are throttled:

```ruby
# Check if alert should be sent
RailsErrorDashboard::Services::BaselineAlertThrottler.should_alert?(
  error_type: "NoMethodError",
  platform: "iOS"
)
# => false if alert sent within cooldown period
```

**Implementation**:
- Last alert time stored in Redis (if available) or database
- Key: `baseline_alert:#{error_type}:#{platform}`
- Expires after cooldown period

### Manual Alert Testing

Test your alert configuration:

```ruby
# Send a test baseline alert
RailsErrorDashboard::BaselineAlertJob.perform_now(
  error_type: "TestError",
  platform: "iOS",
  current_count: 100,
  baseline_mean: 20,
  baseline_std_dev: 10,
  severity: :critical
)
```

## Configuration

### Full Configuration Reference

```ruby
RailsErrorDashboard.configure do |config|
  # === Baseline Calculation ===

# How far back to look for baseline calculation
  config.baseline_lookback_weeks = 4      # Hourly baselines
  config.baseline_lookback_weeks_daily = 12   # Daily baselines
  config.baseline_lookback_weeks_weekly = 52  # Weekly baselines

# Minimum sample size for valid baseline
  config.baseline_min_sample_size = 10

# === Anomaly Detection ===

# Outlier removal threshold (std devs)
  config.baseline_outlier_threshold = 3.0

# Anomaly severity thresholds (std devs above mean)
  config.baseline_elevated_threshold = 2.0
  config.baseline_high_threshold = 3.0
  config.baseline_critical_threshold = 4.0

# === Alerts ===

# Enable/disable baseline alerting
  config.enable_baseline_alerts = true

# Alert threshold (std devs)
  config.baseline_alert_threshold_std_devs = 2.0

# Alert severities
  config.baseline_alert_severities = [:critical, :high]

# Cooldown period (minutes)
  config.baseline_alert_cooldown_minutes = 120

# Alert channels (see notification configuration)
  config.enable_slack_notifications = true
  config.slack_webhook_url = ENV['SLACK_WEBHOOK_URL']
end
```

### Tuning Recommendations

#### For Low-Traffic Applications
```ruby
config.baseline_alert_threshold_std_devs = 3.0  # More lenient
config.baseline_alert_cooldown_minutes = 60     # Shorter cooldown
config.baseline_min_sample_size = 5             # Lower minimum
```

#### For High-Traffic Applications
```ruby
config.baseline_alert_threshold_std_devs = 2.0  # Stricter
config.baseline_alert_cooldown_minutes = 180    # Longer cooldown
config.baseline_min_sample_size = 20            # Higher minimum
```

#### For Noisy Error Types
```ruby
# Option 1: Exclude from alerting
config.baseline_alert_severities = [:critical]  # Only critical

# Option 2: Use higher threshold for specific types
# (Custom logic in BaselineAlertJob)
```

## Best Practices

### 1. Start with Conservative Settings

Begin with **stricter thresholds** and relax them as needed:
```ruby
config.baseline_alert_threshold_std_devs = 3.0  # Start strict
config.baseline_alert_severities = [:critical]  # Only critical
config.baseline_alert_cooldown_minutes = 180    # Longer cooldown
```

Gradually tune based on your team's needs.

### 2. Monitor Baseline Health

Check baselines weekly:
```ruby
stats = RailsErrorDashboard::Queries::BaselineStats.new

# Check which error types have baselines
baseline_coverage = stats.baseline_coverage
# => { "NoMethodError" => 80%, "ArgumentError" => 60% }

# Identify stale baselines (not updated recently)
stale_baselines = ErrorBaseline.where("updated_at < ?", 7.days.ago)
```

### 3. Handle Seasonal Changes

Baselines adapt over time, but sudden changes (e.g., holiday traffic) may cause false alerts:

**Solution**: Temporarily adjust thresholds:
```ruby
# During known high-traffic events
config.baseline_alert_threshold_std_devs = 4.0
```

Or **disable alerts** for specific periods:
```ruby
# In config/initializers/rails_error_dashboard.rb
config.enable_baseline_alerts = ENV['BASELINE_ALERTS_ENABLED'] != 'false'

# Then in production:
export BASELINE_ALERTS_ENABLED=false  # Temporarily disable
```

### 4. Combine with Other Monitoring

Baseline alerts complement (not replace) other monitoring:
- **Application Performance Monitoring (APM)**: Datadog, New Relic, etc.
- **Uptime Monitoring**: Pingdom, StatusCake, etc.
- **Business Metrics**: Revenue, conversions, etc.

Use baseline alerts as **early warning signals** before issues impact users.

### 5. Review Alert History

Periodically review alerts to tune configuration:
```ruby
# Get alert history (from your logging system)
recent_alerts = BaselineAlertJob.where("created_at > ?", 30.days.ago)

# Analyze:
# - Are there frequent false positives? (increase threshold)
# - Are we missing real issues? (decrease threshold)
# - Is cooldown too short? (team fatigued by repeated alerts)
```

### 6. Use Baselines for Capacity Planning

Baselines reveal traffic patterns:
```ruby
stats = RailsErrorDashboard::Queries::BaselineStats.new
hourly_baselines = stats.hourly_baseline("NoMethodError", "iOS")

# Find peak hours
peak_hours = hourly_baselines
  .select { |hour, data| data[:mean] > overall_mean }
  .keys
# => [9, 10, 11, 14, 15, 16]  # Business hours
```

Use this to:
- Scale infrastructure during peak hours
- Schedule deployments during low-traffic periods
- Plan maintenance windows

## Troubleshooting

### "No baseline available for error type"

**Cause**: Not enough historical data to calculate baseline.

**Requirements**:
- At least `baseline_min_sample_size` data points (default: 10)
- Data must span the lookback period (e.g., 4 weeks for hourly)

**Solution**:
```ruby
# Check sample size
ErrorLog.where(error_type: "YourError").count
# If < 10, wait for more data

# Lower minimum if needed (not recommended)
config.baseline_min_sample_size = 5
```

### "Baseline alerts not sending"

**Checklist**:
1. ✅ Alerts enabled: `config.enable_baseline_alerts = true`
2. ✅ Notification channel configured: `config.enable_slack_notifications = true`
3. ✅ Severity matches: `config.baseline_alert_severities` includes current severity
4. ✅ Cooldown expired: Check last alert time
5. ✅ Threshold exceeded: Current count > mean + (threshold * std_dev)

**Debug**:
```ruby
# Check baseline exists
baseline = ErrorBaseline.find_by(error_type: "YourError", platform: "iOS")
baseline.present?  # Should be true

# Check cooldown
throttler = RailsErrorDashboard::Services::BaselineAlertThrottler
throttler.should_alert?(error_type: "YourError", platform: "iOS")
# => true if alert should be sent

# Manually trigger alert
RailsErrorDashboard::BaselineAlertJob.perform_now(...)
```

### "Too many false positive alerts"

**Causes**:
- Threshold too sensitive
- High natural variance in error rates
- Insufficient historical data

**Solutions**:

1. **Increase threshold**:
```ruby
config.baseline_alert_threshold_std_devs = 3.0  # Was 2.0
```

2. **Alert only on critical**:
```ruby
config.baseline_alert_severities = [:critical]  # Was [:critical, :high]
```

3. **Increase cooldown**:
```ruby
config.baseline_alert_cooldown_minutes = 240  # Was 120 (4 hours)
```

4. **Exclude noisy error types**:
```ruby
# In BaselineAlertJob (custom modification)
return if error_type.in?(["CommonWarning", "ExpectedError"])
```

### "Alerts missing real issues"

**Causes**:
- Threshold too lenient
- Gradual increases not detected (boiling frog problem)
- Baseline not updated recently

**Solutions**:

1. **Decrease threshold**:
```ruby
config.baseline_alert_threshold_std_devs = 1.5  # Was 2.0
```

2. **Alert on more severities**:
```ruby
config.baseline_alert_severities = [:critical, :high, :elevated]
```

3. **Recalculate baselines**:
```ruby
# Force recalculation
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines(force: true)
```

4. **Monitor trends manually**:
```ruby
stats = RailsErrorDashboard::Queries::BaselineStats.new
trends = stats.error_trends(days: 30)
# Look for gradual increases
```

### "Baselines seem inaccurate"

**Causes**:
- Recent code changes altered normal behavior
- Seasonal patterns not yet learned
- Outliers not properly filtered

**Solutions**:

1. **Reset baselines after major changes**:
```ruby
# Delete old baselines for affected error types
ErrorBaseline.where(error_type: "AffectedError").delete_all

# Recalculate will use only recent data
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines
```

2. **Adjust outlier threshold**:
```ruby
config.baseline_outlier_threshold = 2.5  # Was 3.0 (more aggressive filtering)
```

3. **Use shorter lookback for new features**:
```ruby
# Temporarily use shorter period
config.baseline_lookback_weeks = 2  # Was 4
```

## Database Schema

### ErrorBaseline Table

```ruby
create_table :rails_error_dashboard_error_baselines do |t|
  t.string :error_type, null: false
  t.string :platform
  t.string :baseline_type, null: false  # "hourly", "daily", "weekly"

t.datetime :period_start, null: false
  t.datetime :period_end, null: false

# Statistical measures
  t.float :mean
  t.float :std_dev
  t.float :percentile_95
  t.float :percentile_99
  t.integer :sample_size

t.timestamps
end

add_index :rails_error_dashboard_error_baselines,
  [:error_type, :platform, :baseline_type, :period_start],
  name: "index_error_baselines_on_type_platform_baseline_period"
```

## API Reference

### BaselineStats Query Object

```ruby
stats = RailsErrorDashboard::Queries::BaselineStats.new

# Get baseline for specific error type
baseline = stats.hourly_baseline("NoMethodError", "iOS")
# => { mean: 15, std_dev: 5, percentile_95: 23, ... }

# Get current anomalies
anomalies = stats.current_anomalies(severity: [:high, :critical])
# => [{ error_type: "...", severity: :critical, std_devs_above: 4.2 }]

# Check if current count is anomalous
is_anomaly = stats.is_anomaly?(
  error_type: "NoMethodError",
  platform: "iOS",
  current_count: 35
)
# => { anomaly: true, severity: :critical, std_devs_above: 4.0 }
```

### BaselineCalculator Service

```ruby
# Calculate all baselines
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines

# Calculate for specific error type
RailsErrorDashboard::Services::BaselineCalculator.calculate_baseline(
  error_type: "NoMethodError",
  platform: "iOS",
  baseline_type: "hourly"
)
```

### BaselineAlertThrottler Service

```ruby
throttler = RailsErrorDashboard::Services::BaselineAlertThrottler

# Check if alert should be sent
should_send = throttler.should_alert?(
  error_type: "NoMethodError",
  platform: "iOS"
)
# => true or false

# Record that alert was sent
throttler.record_alert(
  error_type: "NoMethodError",
  platform: "iOS"
)
```

## Further Reading

- [Advanced Error Grouping Guide](/rails_error_dashboard/docs/features/advanced-error-grouping/) - Fuzzy matching and cascades
- [Platform Comparison Guide](/rails_error_dashboard/docs/features/platform-comparison/) - iOS vs Android analysis
- [Occurrence Patterns Guide](/rails_error_dashboard/docs/features/occurrence-patterns/) - Cyclical and burst patterns
- [Error Correlation Guide](/rails_error_dashboard/docs/features/error-correlation/) - Release and user correlation

Baseline Monitoring and Alerts

This guide covers the intelligent baseline monitoring features, including statistical baseline calculation, anomaly detection, and automated alerting.

⚙️ Optional Feature - Baseline monitoring is disabled by default. Enable it in your initializer:

RailsErrorDashboard.configure do |config|
  config.enable_baseline_alerts = true
  config.baseline_alert_threshold_std_devs = 2.0  # Alert when >2 std devs above baseline
  config.baseline_alert_severities = [:critical, :high]  # Alert on these severities
  config.baseline_alert_cooldown_minutes = 120  # 2 hours between alerts
end

Overview
Baseline Calculation
Anomaly Detection
Automated Alerts
Configuration
Best Practices
Troubleshooting

Overview

Baseline monitoring goes beyond simple spike detection by using statistical methods to:

Calculate intelligent baselines - Not just averages, but statistically sound thresholds
Detect anomalies - Identify when error rates exceed expected ranges
Send proactive alerts - Notify teams before issues escalate
Track trends - Monitor how baselines evolve over time

Why Baselines Matter

Simple thresholds (e.g., “alert if >100 errors/hour”) don’t work because:

Normal error rates vary by time of day, day of week, and season
What’s normal for one error type may be abnormal for another
Static thresholds cause false positives (alert fatigue) or miss real issues

Baselines solve this by establishing dynamic, context-aware thresholds.

Baseline Calculation

What is a Baseline?

A baseline is the expected normal range for an error type, calculated from historical data using statistical methods.

For each error type and platform combination, we track:

Mean: Average error count
Standard Deviation: How much variation is normal
Percentiles: 95th and 99th percentile values
Sample Size: How many data points were used

Baseline Types

We calculate three types of baselines:

1. Hourly Baseline

Lookback: Last 4 weeks
Granularity: By hour of day (0-23)
Use Case: Detect unusual spikes during specific hours
Example: “Between 2-3 PM, we normally see 10-20 errors, but today we saw 150”

2. Daily Baseline

Lookback: Last 12 weeks
Granularity: By day of week (Mon-Sun)
Use Case: Detect unusual daily patterns
Example: “Mondays usually have 500 errors, but this Monday had 2000”

3. Weekly Baseline

Lookback: Last 52 weeks (1 year)
Granularity: By week number
Use Case: Detect long-term trends and seasonal changes
Example: “This week’s error rate is 3x higher than the same week last year”

Statistical Method

We use a robust statistical approach that handles outliers:

# Simplified algorithm
def calculate_baseline(error_counts)
  # 1. Remove extreme outliers (> 3 standard deviations)
  mean = error_counts.mean
  std_dev = error_counts.standard_deviation
  filtered = error_counts.reject { |count| count > mean + (3 * std_dev) }

  # 2. Recalculate statistics on filtered data
  baseline_mean = filtered.mean
  baseline_std_dev = filtered.standard_deviation

  # 3. Calculate percentiles
  percentile_95 = filtered.percentile(95)
  percentile_99 = filtered.percentile(99)

  {
    mean: baseline_mean,
    std_dev: baseline_std_dev,
    percentile_95: percentile_95,
    percentile_99: percentile_99,
    sample_size: filtered.count
  }
end

How Baselines are Calculated

The BaselineCalculator service runs daily via background job:

# Triggered automatically
RailsErrorDashboard::BaselineCalculationJob.perform_later

# Or manually via console
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines

Process:

For each unique (error_type, platform) pair:
Fetch error counts for lookback period
Group by time unit (hour/day/week)
Calculate statistics (mean, std_dev, percentiles)
Store in error_baselines table
Update existing baselines or create new ones

Performance: Full recalculation takes ~5-10 minutes for 10,000 errors.

Anomaly Detection

What is an Anomaly?

An anomaly occurs when the current error count significantly exceeds the baseline.

We use a standard deviation-based approach:

current_count > baseline_mean + (threshold * std_dev)

Severity Levels

Anomalies are classified by how far above the baseline they are:

Severity	Threshold	Description	Action
Normal	< 2 std devs	Within expected range	No alert
Elevated	2-3 std devs	Moderately above normal	Monitor
High	3-4 std devs	Significantly above normal	Investigate
Critical	> 4 std devs	Extremely abnormal	Immediate action

Example Calculation

Baseline for "NoMethodError" on iOS, 2-3 PM:
- Mean: 15 errors
- Std Dev: 5 errors
- Percentile 95: 23 errors

Current count: 35 errors

Calculation:
35 - 15 = 20 errors above mean
20 / 5 = 4 standard deviations

Severity: Critical (> 4 std devs)

Accessing Anomaly Data

# Get current anomalies
stats = RailsErrorDashboard::Queries::BaselineStats.new
anomalies = stats.current_anomalies(severity: [:high, :critical])

anomalies.each do |anomaly|
  puts "#{anomaly[:error_type]} on #{anomaly[:platform]}"
  puts "  Current: #{anomaly[:current_count]}"
  puts "  Baseline: #{anomaly[:baseline_mean]} ± #{anomaly[:baseline_std_dev]}"
  puts "  Severity: #{anomaly[:severity]} (#{anomaly[:std_devs_above]} std devs)"
end

Dashboard Integration

Anomalies are automatically displayed:

Dashboard: Anomaly alerts card shows active anomalies
Error Show Page: Baseline comparison chart
Analytics: Trend charts with baseline ranges

Automated Alerts

Overview

Baseline alerts proactively notify your team when errors exceed baselines, before they become critical issues.

Alert Configuration

Enable baseline alerts in your initializer:

# config/initializers/rails_error_dashboard.rb
RailsErrorDashboard.configure do |config|
  # Enable baseline alerting
  config.enable_baseline_alerts = true

  # Alert threshold (standard deviations above mean)
  config.baseline_alert_threshold_std_devs = 2.0  # Default: 2.0

  # Which severities to alert on
  config.baseline_alert_severities = [:critical, :high]  # Default: [:critical, :high]

  # Cooldown period between alerts for same error type (minutes)
  config.baseline_alert_cooldown_minutes = 120  # Default: 120 (2 hours)

  # Alert channels (same as error notifications)
  config.enable_slack_notifications = true
  config.slack_webhook_url = ENV['SLACK_WEBHOOK_URL']

  config.enable_email_notifications = true
  config.notification_email = "errors@example.com"
end

Alert Triggers

Alerts are sent when:

✅ Baseline alerting is enabled
✅ Error count exceeds threshold (e.g., mean + 2 std devs)
✅ Severity matches configured severities
✅ Cooldown period has elapsed since last alert

Alert Payload

{
  "alert_type": "baseline_violation",
  "error_type": "NoMethodError",
  "platform": "iOS",
  "current_count": 35,
  "baseline_mean": 15,
  "baseline_std_dev": 5,
  "std_devs_above": 4.0,
  "severity": "critical",
  "time_period": "2-3 PM",
  "baseline_type": "hourly",
  "trend": "increasing",
  "dashboard_url": "https://yourapp.com/errors/123"
}

Alert Channels

Baseline alerts use the same notification channels as error notifications:

Slack: Rich message with charts and links
Email: HTML email with details and dashboard link
Discord: Webhook notification
PagerDuty: Incident creation for critical alerts
Custom Webhook: POST JSON to your endpoint

Cooldown Mechanism

To prevent alert fatigue, alerts are throttled:

# Check if alert should be sent
RailsErrorDashboard::Services::BaselineAlertThrottler.should_alert?(
  error_type: "NoMethodError",
  platform: "iOS"
)
# => false if alert sent within cooldown period

Implementation:

Last alert time stored in Redis (if available) or database
Key: baseline_alert:#{error_type}:#{platform}
Expires after cooldown period

Manual Alert Testing

Test your alert configuration:

# Send a test baseline alert
RailsErrorDashboard::BaselineAlertJob.perform_now(
  error_type: "TestError",
  platform: "iOS",
  current_count: 100,
  baseline_mean: 20,
  baseline_std_dev: 10,
  severity: :critical
)

Configuration

Full Configuration Reference

RailsErrorDashboard.configure do |config|
  # === Baseline Calculation ===

  # How far back to look for baseline calculation
  config.baseline_lookback_weeks = 4      # Hourly baselines
  config.baseline_lookback_weeks_daily = 12   # Daily baselines
  config.baseline_lookback_weeks_weekly = 52  # Weekly baselines

  # Minimum sample size for valid baseline
  config.baseline_min_sample_size = 10

  # === Anomaly Detection ===

  # Outlier removal threshold (std devs)
  config.baseline_outlier_threshold = 3.0

  # Anomaly severity thresholds (std devs above mean)
  config.baseline_elevated_threshold = 2.0
  config.baseline_high_threshold = 3.0
  config.baseline_critical_threshold = 4.0

  # === Alerts ===

  # Enable/disable baseline alerting
  config.enable_baseline_alerts = true

  # Alert threshold (std devs)
  config.baseline_alert_threshold_std_devs = 2.0

  # Alert severities
  config.baseline_alert_severities = [:critical, :high]

  # Cooldown period (minutes)
  config.baseline_alert_cooldown_minutes = 120

  # Alert channels (see notification configuration)
  config.enable_slack_notifications = true
  config.slack_webhook_url = ENV['SLACK_WEBHOOK_URL']
end

Tuning Recommendations

For Low-Traffic Applications

config.baseline_alert_threshold_std_devs = 3.0  # More lenient
config.baseline_alert_cooldown_minutes = 60     # Shorter cooldown
config.baseline_min_sample_size = 5             # Lower minimum

For High-Traffic Applications

config.baseline_alert_threshold_std_devs = 2.0  # Stricter
config.baseline_alert_cooldown_minutes = 180    # Longer cooldown
config.baseline_min_sample_size = 20            # Higher minimum

For Noisy Error Types

# Option 1: Exclude from alerting
config.baseline_alert_severities = [:critical]  # Only critical

# Option 2: Use higher threshold for specific types
# (Custom logic in BaselineAlertJob)

Best Practices

1. Start with Conservative Settings

Begin with stricter thresholds and relax them as needed:

config.baseline_alert_threshold_std_devs = 3.0  # Start strict
config.baseline_alert_severities = [:critical]  # Only critical
config.baseline_alert_cooldown_minutes = 180    # Longer cooldown

Gradually tune based on your team’s needs.

2. Monitor Baseline Health

Check baselines weekly:

stats = RailsErrorDashboard::Queries::BaselineStats.new

# Check which error types have baselines
baseline_coverage = stats.baseline_coverage
# => { "NoMethodError" => 80%, "ArgumentError" => 60% }

# Identify stale baselines (not updated recently)
stale_baselines = ErrorBaseline.where("updated_at < ?", 7.days.ago)

3. Handle Seasonal Changes

Baselines adapt over time, but sudden changes (e.g., holiday traffic) may cause false alerts:

Solution: Temporarily adjust thresholds:

# During known high-traffic events
config.baseline_alert_threshold_std_devs = 4.0

Or disable alerts for specific periods:

# In config/initializers/rails_error_dashboard.rb
config.enable_baseline_alerts = ENV['BASELINE_ALERTS_ENABLED'] != 'false'

# Then in production:
export BASELINE_ALERTS_ENABLED=false  # Temporarily disable

4. Combine with Other Monitoring

Baseline alerts complement (not replace) other monitoring:

Application Performance Monitoring (APM): Datadog, New Relic, etc.
Uptime Monitoring: Pingdom, StatusCake, etc.
Business Metrics: Revenue, conversions, etc.

Use baseline alerts as early warning signals before issues impact users.

5. Review Alert History

Periodically review alerts to tune configuration:

# Get alert history (from your logging system)
recent_alerts = BaselineAlertJob.where("created_at > ?", 30.days.ago)

# Analyze:
# - Are there frequent false positives? (increase threshold)
# - Are we missing real issues? (decrease threshold)
# - Is cooldown too short? (team fatigued by repeated alerts)

6. Use Baselines for Capacity Planning

Baselines reveal traffic patterns:

stats = RailsErrorDashboard::Queries::BaselineStats.new
hourly_baselines = stats.hourly_baseline("NoMethodError", "iOS")

# Find peak hours
peak_hours = hourly_baselines
  .select { |hour, data| data[:mean] > overall_mean }
  .keys
# => [9, 10, 11, 14, 15, 16]  # Business hours

Use this to:

Scale infrastructure during peak hours
Schedule deployments during low-traffic periods
Plan maintenance windows

Troubleshooting

“No baseline available for error type”

Cause: Not enough historical data to calculate baseline.

Requirements:

At least baseline_min_sample_size data points (default: 10)
Data must span the lookback period (e.g., 4 weeks for hourly)

Solution:

# Check sample size
ErrorLog.where(error_type: "YourError").count
# If < 10, wait for more data

# Lower minimum if needed (not recommended)
config.baseline_min_sample_size = 5

“Baseline alerts not sending”

Checklist:

✅ Alerts enabled: config.enable_baseline_alerts = true
✅ Notification channel configured: config.enable_slack_notifications = true
✅ Severity matches: config.baseline_alert_severities includes current severity
✅ Cooldown expired: Check last alert time
✅ Threshold exceeded: Current count > mean + (threshold * std_dev)

Debug:

# Check baseline exists
baseline = ErrorBaseline.find_by(error_type: "YourError", platform: "iOS")
baseline.present?  # Should be true

# Check cooldown
throttler = RailsErrorDashboard::Services::BaselineAlertThrottler
throttler.should_alert?(error_type: "YourError", platform: "iOS")
# => true if alert should be sent

# Manually trigger alert
RailsErrorDashboard::BaselineAlertJob.perform_now(...)

“Too many false positive alerts”

Causes:

Threshold too sensitive
High natural variance in error rates
Insufficient historical data

Solutions:

Increase threshold:

config.baseline_alert_threshold_std_devs = 3.0  # Was 2.0

Alert only on critical:

config.baseline_alert_severities = [:critical]  # Was [:critical, :high]

Increase cooldown:

config.baseline_alert_cooldown_minutes = 240  # Was 120 (4 hours)

Exclude noisy error types:

# In BaselineAlertJob (custom modification)
return if error_type.in?(["CommonWarning", "ExpectedError"])

“Alerts missing real issues”

Causes:

Threshold too lenient
Gradual increases not detected (boiling frog problem)
Baseline not updated recently

Solutions:

Decrease threshold:

config.baseline_alert_threshold_std_devs = 1.5  # Was 2.0

Alert on more severities:

config.baseline_alert_severities = [:critical, :high, :elevated]

Recalculate baselines:

# Force recalculation
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines(force: true)

Monitor trends manually:

stats = RailsErrorDashboard::Queries::BaselineStats.new
trends = stats.error_trends(days: 30)
# Look for gradual increases

“Baselines seem inaccurate”

Causes:

Recent code changes altered normal behavior
Seasonal patterns not yet learned
Outliers not properly filtered

Solutions:

Reset baselines after major changes: ```ruby
Delete old baselines for affected error types

ErrorBaseline.where(error_type: “AffectedError”).delete_all

Recalculate will use only recent data

RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines

2. **Adjust outlier threshold**:
```ruby
config.baseline_outlier_threshold = 2.5  # Was 3.0 (more aggressive filtering)

Use shorter lookback for new features:

# Temporarily use shorter period
config.baseline_lookback_weeks = 2  # Was 4

Database Schema

ErrorBaseline Table

create_table :rails_error_dashboard_error_baselines do |t|
  t.string :error_type, null: false
  t.string :platform
  t.string :baseline_type, null: false  # "hourly", "daily", "weekly"

  t.datetime :period_start, null: false
  t.datetime :period_end, null: false

  # Statistical measures
  t.float :mean
  t.float :std_dev
  t.float :percentile_95
  t.float :percentile_99
  t.integer :sample_size

  t.timestamps
end

add_index :rails_error_dashboard_error_baselines,
  [:error_type, :platform, :baseline_type, :period_start],
  name: "index_error_baselines_on_type_platform_baseline_period"

API Reference

BaselineStats Query Object

stats = RailsErrorDashboard::Queries::BaselineStats.new

# Get baseline for specific error type
baseline = stats.hourly_baseline("NoMethodError", "iOS")
# => { mean: 15, std_dev: 5, percentile_95: 23, ... }

# Get current anomalies
anomalies = stats.current_anomalies(severity: [:high, :critical])
# => [{ error_type: "...", severity: :critical, std_devs_above: 4.2 }]

# Check if current count is anomalous
is_anomaly = stats.is_anomaly?(
  error_type: "NoMethodError",
  platform: "iOS",
  current_count: 35
)
# => { anomaly: true, severity: :critical, std_devs_above: 4.0 }

BaselineCalculator Service

# Calculate all baselines
RailsErrorDashboard::Services::BaselineCalculator.calculate_all_baselines

# Calculate for specific error type
RailsErrorDashboard::Services::BaselineCalculator.calculate_baseline(
  error_type: "NoMethodError",
  platform: "iOS",
  baseline_type: "hourly"
)

BaselineAlertThrottler Service

throttler = RailsErrorDashboard::Services::BaselineAlertThrottler

# Check if alert should be sent
should_send = throttler.should_alert?(
  error_type: "NoMethodError",
  platform: "iOS"
)
# => true or false

# Record that alert was sent
throttler.record_alert(
  error_type: "NoMethodError",
  platform: "iOS"
)