Skip to content

Comments

fix(redis): prevent false rate limits and code execution failures during redis outages#3289

Merged
waleedlatif1 merged 1 commit intostagingfrom
fix/rate-limits
Feb 21, 2026
Merged

fix(redis): prevent false rate limits and code execution failures during redis outages#3289
waleedlatif1 merged 1 commit intostagingfrom
fix/rate-limits

Conversation

@waleedlatif1
Copy link
Collaborator

Summary

  • Add 30s PING health check that forces reconnect after 3 consecutive failures — detects stale TCP connections that ioredis can't see
  • Rate limiter fails open on Redis errors instead of blocking paying customers with false 429s
  • IVM distributed lease falls back to local execution when Redis is unavailable instead of returning "temporarily unavailable"
  • Storage factory falls back to PostgreSQL when Redis client is transiently unavailable instead of throwing
  • Replace linear retry backoff with exponential backoff + jitter to prevent thundering herd on reconnect

Context

On Feb 21, a stale TCP connection between ECS and Redis Cloud caused all Redis commands to time out. Because the rate limiter failed closed and the IVM lease returned errors on Redis failures, this silently blocked all non-manual workflow executions — producing 329 false "Rate limit exceeded" errors and 248 "Code execution is temporarily unavailable" errors across 15+ workspaces. An ECS restart fixed the immediate issue. These changes prevent recurrence.

Type of Change

  • Bug fix

Testing

  • 23 new/updated tests across 3 suites (redis config, storage factory, isolated-vm)
  • Verified TypeScript compiles clean

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link

vercel bot commented Feb 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Feb 21, 2026 8:05pm

Request Review

@waleedlatif1 waleedlatif1 changed the title fix(redis): prevent false rate limits and code execution failures during Redis outages fix(redis): prevent false rate limits and code execution failures during redis outages Feb 21, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 21, 2026

Greptile Summary

This PR implements critical resilience improvements to prevent false rate limits and code execution failures during Redis outages. The changes shift from fail-closed to fail-open behavior across three key systems:

Key Changes:

  • PING health check: 30s interval detects stale TCP connections that ioredis can't see, forcing reconnect after 3 consecutive failures
  • Rate limiter: Now fails open (allows requests) instead of returning false 429s when Redis is unavailable
  • Storage factory: Falls back to PostgreSQL when Redis client unavailable, with reconnect listener to invalidate cached adapters
  • IVM distributed lease: Falls back to local execution instead of blocking with "temporarily unavailable" errors
  • Retry strategy: Replaced linear backoff with exponential backoff + jitter to prevent thundering herd

Testing:
Comprehensive test coverage with 23 new/updated tests verifying health check behavior, reconnect listeners, fallback logic, and retry strategy calculations.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-tested with 23 new/updated tests, follows established patterns, and addresses a real production incident with appropriate fail-open strategies that preserve availability while maintaining local enforcement of concurrency limits
  • No files require special attention

Important Files Changed

Filename Overview
apps/sim/lib/core/config/redis.ts Added 30s PING health check with reconnect listener system and exponential backoff retry strategy
apps/sim/lib/core/rate-limiter/rate-limiter.ts Changed from fail-closed to fail-open on storage errors to prevent false 429s during outages
apps/sim/lib/core/rate-limiter/storage/factory.ts Added PostgreSQL fallback when Redis configured but unavailable, with reconnect listener for cache invalidation
apps/sim/lib/execution/isolated-vm.ts Changed distributed lease to fall back to local execution instead of returning error when Redis unavailable

Sequence Diagram

sequenceDiagram
    participant App as Application
    participant Redis as Redis Health Check
    participant RateLimit as Rate Limiter
    participant Storage as Storage Factory
    participant IVM as IVM Distributed Lease
    participant PG as PostgreSQL

    Note over Redis: Every 30s PING check
    Redis->>Redis: PING fails (3 consecutive)
    Redis->>Redis: Force disconnect(true)
    Redis->>Storage: Notify reconnect listeners
    Storage->>Storage: Clear cached adapter

    App->>RateLimit: Check rate limit
    RateLimit->>Storage: Get storage adapter
    alt Redis unavailable
        Storage->>PG: Fall back to DbTokenBucket
        Storage-->>RateLimit: Return DB adapter
    else Redis available
        Storage-->>RateLimit: Return Redis adapter
    end
    
    alt Storage error during check
        RateLimit-->>App: Fail open (allow=true)
    else Storage success
        RateLimit-->>App: Return actual limit result
    end

    App->>IVM: Execute code
    IVM->>IVM: Try acquire distributed lease
    alt Redis unavailable
        IVM->>IVM: Fall back to local execution
        IVM-->>App: Execute with local pool limits
    else Redis error
        IVM->>IVM: Fall back to local execution
        IVM-->>App: Execute with local pool limits
    else Lease acquired
        IVM-->>App: Execute normally
    end
Loading

Last reviewed commit: 6e66168

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@waleedlatif1
Copy link
Collaborator Author

@cursor review

@waleedlatif1
Copy link
Collaborator Author

@greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

@waleedlatif1 waleedlatif1 merged commit ccb4f59 into staging Feb 21, 2026
12 checks passed
@waleedlatif1 waleedlatif1 deleted the fix/rate-limits branch February 21, 2026 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant