Rails Error Dashboard

Self-hosted error tracking and exception monitoring for Ruby on Rails. Open source, free forever. A Sentry alternative with cause chains, analytics, workflow management, and multi-channel notifications.

Download as .zip Download as .tar.gz View on GitHub

Multi-App Performance Monitoring

This guide covers performance monitoring and optimization for Rails Error Dashboard’s multi-app support feature.

Architecture Overview

Multi-app support is designed for high-concurrency scenarios with multiple Rails applications writing errors simultaneously to a shared database.

Key Design Decisions

  1. Row-Level Locking: Uses pessimistic locking scoped to (application_id, error_hash) - apps never block each other
  2. Cached Application Lookups: Application names cached for 1 hour to reduce database hits
  3. Composite Indexes: Optimized indexes on [application_id, occurred_at] and [application_id, resolved]
  4. Per-App Deduplication: Error hashes include application_id to track same errors independently across apps
  5. Consistent Lock Ordering: Prevents deadlocks by always locking in (application_id ASC, error_hash ASC) order

Database Performance

Index Usage Monitoring (PostgreSQL)

Check if multi-app indexes are being used effectively:

-- Index usage statistics
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan as scans,
  idx_tup_read as tuples_read,
  idx_tup_fetch as tuples_fetched
FROM pg_stat_user_indexes
WHERE tablename = 'rails_error_dashboard_error_logs'
  AND indexname LIKE '%application%'
ORDER BY idx_scan DESC;

Expected Results:

If scan counts are low (<100), the indexes may not be used. Check:

  1. Are you filtering by application_id?
  2. Is the query planner choosing a different strategy? (Run EXPLAIN ANALYZE)

Slow Query Detection

Find slow queries involving applications:

-- Requires pg_stat_statements extension
SELECT
  query,
  calls,
  total_exec_time / calls as avg_time_ms,
  min_exec_time as min_ms,
  max_exec_time as max_ms,
  stddev_exec_time as stddev_ms
FROM pg_stat_statements
WHERE query LIKE '%rails_error_dashboard_error_logs%'
  AND query LIKE '%application_id%'
ORDER BY avg_time_ms DESC
LIMIT 20;

Optimization Targets:

Cache Hit Rate

Monitor application lookup cache performance:

-- PostgreSQL cache hit ratio
SELECT
  sum(heap_blks_read) as heap_read,
  sum(heap_blks_hit) as heap_hit,
  sum(heap_blks_hit) / NULLIF((sum(heap_blks_hit) + sum(heap_blks_read)), 0) as cache_ratio
FROM pg_statio_user_tables
WHERE relname = 'rails_error_dashboard_applications';

Target: Cache ratio > 0.99 (99%+)

If lower:

Lock Monitoring

Check for lock contention between applications:

-- Active locks on error_logs table
SELECT
  l.pid,
  l.mode,
  l.granted,
  a.application_name,
  a.query_start,
  a.state,
  substring(a.query, 1, 100) as query
FROM pg_locks l
JOIN pg_stat_activity a ON l.pid = a.pid
WHERE l.relation = 'rails_error_dashboard_error_logs'::regclass
ORDER BY l.granted, a.query_start;

Expected Behavior:

If you see ungranted locks:

  1. Check for long-running transactions
  2. Verify row-level locking is working (check find_or_increment_by_hash uses .lock)
  3. Look for deadlocks in PostgreSQL logs

Deadlock Detection

-- Check PostgreSQL logs for deadlocks
-- In postgresql.conf: log_lock_waits = on, deadlock_timeout = 1s

-- Or query recent deadlocks (requires logging_collector = on)
SELECT * FROM pg_stat_database_conflicts
WHERE datname = current_database();

Target: Zero deadlocks

Our design prevents deadlocks by:

Rails Application Monitoring

Cache Performance

Monitor Rails cache for application lookups:

# In Rails console
stats = Rails.cache.stats

# Check hit rate
cache_key_pattern = "error_dashboard/application/*"

# Clear cache and test
Rails.cache.clear
app1 = RailsErrorDashboard::Application.find_or_create_by_name("TestApp")
app2 = RailsErrorDashboard::Application.find_or_create_by_name("TestApp") # Should hit cache

# Verify cache
cached = Rails.cache.read("error_dashboard/application/TestApp")
puts "Cached: #{cached.inspect}"

Expected:

Query Object Performance

Measure analytics query performance by application:

# Benchmark dashboard stats
require 'benchmark'

apps = RailsErrorDashboard::Application.pluck(:id).sample(5)

Benchmark.bm(20) do |x|
  x.report("All apps:") do
    RailsErrorDashboard::Queries::DashboardStats.call
  end

  apps.each do |app_id|
    x.report("App #{app_id}:") do
      RailsErrorDashboard::Queries::DashboardStats.call(application_id: app_id)
    end
  end
end

Targets:

Error Write Performance

Benchmark error logging with multiple apps:

require 'benchmark'

apps = 5.times.map { |i| RailsErrorDashboard::Application.create!(name: "BenchApp#{i}") }

Benchmark.bm(20) do |x|
  x.report("Sequential writes:") do
    100.times do
      app = apps.sample
      RailsErrorDashboard::ErrorLog.find_or_increment_by_hash(
        "test_#{rand(10)}",
        application_id: app.id,
        error_type: "TestError",
        message: "Benchmark test",
        occurred_at: Time.current
      )
    end
  end

  x.report("Concurrent writes:") do
    threads = 10.times.map do
      Thread.new do
        10.times do
          app = apps.sample
          RailsErrorDashboard::ErrorLog.find_or_increment_by_hash(
            "concurrent_#{rand(10)}",
            application_id: app.id,
            error_type: "ConcurrentError",
            message: "Thread test",
            occurred_at: Time.current
          )
        end
      end
    end
    threads.each(&:join)
  end
end

Targets:

Production Monitoring

Metrics to Track

  1. Error Write Latency
    • P50, P95, P99 for LogError.call
    • Target: P95 < 50ms, P99 < 200ms
  2. Application Cache Hit Rate
    • Percentage of find_or_create_by_name that hit cache
    • Target: > 95%
  3. Query Performance
    • Dashboard stats query time
    • Analytics query time
    • Filtering query time
    • Target: All < 100ms P95
  4. Database Metrics
    • Lock wait time
    • Deadlock count (should be 0)
    • Index scan ratio (vs sequential scans)

New Relic / APM Integration

# In config/initializers/rails_error_dashboard.rb

# Track error logging performance
ActiveSupport::Notifications.subscribe("log_error.rails_error_dashboard") do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)

  NewRelic::Agent.record_metric(
    "Custom/ErrorDashboard/LogError",
    event.duration
  )

  NewRelic::Agent.record_metric(
    "Custom/ErrorDashboard/Application/#{event.payload[:application_name]}",
    event.duration
  )
end

# Instrument find_or_create_by_name
RailsErrorDashboard::Application.class_eval do
  def self.find_or_create_by_name_with_instrumentation(name)
    ActiveSupport::Notifications.instrument("application.find_or_create", application: name) do
      find_or_create_by_name_without_instrumentation(name)
    end
  end

  class << self
    alias_method :find_or_create_by_name_without_instrumentation, :find_or_create_by_name
    alias_method :find_or_create_by_name, :find_or_create_by_name_with_instrumentation
  end
end

Alerting Thresholds

Set up alerts for:

  1. High Error Write Latency
    • Alert if P95 > 200ms for 5 minutes
    • Action: Check database load, slow queries
  2. Cache Miss Rate
    • Alert if cache hit rate < 90% for 10 minutes
    • Action: Check Rails cache backend, memory
  3. Dashboard Query Slowness
    • Alert if dashboard stats > 500ms
    • Action: Check error log count, consider archiving old errors
  4. Lock Contention
    • Alert if lock wait events > 10/minute
    • Action: Check for long transactions, review locking strategy

Performance Checklist

Use this checklist to verify multi-app performance:

Database Layer

Rails Application Layer

Monitoring

Troubleshooting

Problem: Slow Error Writes (>200ms)

Diagnosis:

-- Check if indexes are used
EXPLAIN ANALYZE
SELECT * FROM rails_error_dashboard_error_logs
WHERE application_id = 1 AND error_hash = 'abc123';

Solutions:

  1. Verify composite index exists: CREATE INDEX index_error_logs_on_app_hash ON rails_error_dashboard_error_logs(application_id, error_hash);
  2. Check database load and connection pool
  3. Enable async logging in config

Problem: High Cache Miss Rate

Diagnosis:

# Check if cache backend is working
Rails.cache.write("test", "value")
Rails.cache.read("test") # Should return "value"

Solutions:

  1. Check Rails cache backend (config/environments/production.rb)
  2. Verify cache size limits aren’t being hit
  3. Check if cache is being cleared too frequently

Problem: Deadlocks

Diagnosis:

# Check PostgreSQL logs
grep "deadlock detected" /var/log/postgresql/postgresql-*.log

Solutions:

  1. Verify find_or_increment_by_hash uses .lock
  2. Check for custom queries that bypass row-level locking
  3. Review transaction isolation level (should be READ COMMITTED)

Problem: Ungranted Locks

Diagnosis:

SELECT * FROM pg_locks WHERE NOT granted;

Solutions:

  1. Check for long-running transactions blocking writes
  2. Verify no table-level locks (DDL operations during writes)
  3. Consider increasing max_locks_per_transaction if hitting limit

Benchmarks

These are reference benchmarks from testing multi-app support:

Write Performance

Scenario Throughput P95 Latency P99 Latency
1 app, sequential 200 writes/sec 5ms 10ms
5 apps, sequential 195 writes/sec 6ms 12ms
5 apps, 10 threads 950 writes/sec 25ms 45ms
5 apps, 50 threads 2800 writes/sec 85ms 150ms

Test environment: PostgreSQL 14, 4 CPU, 8GB RAM, local network

Cache Performance

Operation Without Cache With Cache Speedup
Application lookup 2.5ms 0.05ms 50x
1000 lookups 2500ms 50ms 50x

Query Performance

Query All Apps Single App Reduction
Dashboard stats 45ms 12ms 73%
Analytics (7 days) 180ms 55ms 69%
Error list (paginated) 25ms 8ms 68%

Based on database with 100,000 errors across 5 applications

Optimization Tips

  1. Use Async Logging: Enable config.async_logging = true to move error writes out of request cycle

  2. Archive Old Errors: Use error_dashboard:cleanup_resolved rake task to remove old resolved errors

  3. Separate Database: Consider dedicated database for error dashboard to isolate performance impact

  4. Connection Pooling: Increase connection pool size if seeing “could not obtain connection” errors

  5. Read Replicas: Route dashboard queries to read replicas to reduce load on primary

  6. Sampling: Enable error sampling for high-frequency errors to reduce write volume

Contact

For performance issues or optimization questions:

Include: