Building a Resilience Engine in Python: Internals of LimitPal (Part 2)

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MyrinNew
    Senior Member
    • Feb 2024
    • 5175

    #1

    Building a Resilience Engine in Python: Internals of LimitPal (Part 2)

    How the executor pipeline, clock abstraction, and circuit breaker architecture actually work.


    If you haven’t read Part 1, the short version:


    Resilience shouldn’t be a pile of decorators.

    It should be a system.


    Part 1 explained the motivation.


    This post is about how the system is built.



    The core design constraint

    I started with one rule:


    Every resilience feature must compose cleanly with others.


    Most libraries solve a single concern well.


    But composition is where systems break.


    Retry + rate limiting + circuit breaker is not additive.

    It’s architectural.


    So LimitPal is built around one idea:


    👉 A single execution pipeline


    Everything plugs into it.



    The executor pipeline

    Every call flows through the same stages:






    Circuit breaker → Rate limiter → Retry loop → Result recording







    Not arbitrary order.


    This ordering is deliberate.


    Step 1: Circuit breaker first

    Fail fast.


    If the upstream service is already down,

    don’t waste tokens,

    don’t trigger retries,

    don’t create load.


    This protects your own system.

    Step 2: Rate limiter

    Only after we know execution is allowed

    do we consume capacity.


    This ensures:
    • breaker failures don’t eat quota
    • retries still respect rate limits
    • burst behavior stays predictable

    Step 3: Retry loop

    Retry lives inside the limiter window.


    Not outside.


    This is important.


    If retry lived outside,

    one logical call could consume infinite capacity.


    Inside the window:


    A call is a budgeted operation.


    That constraint keeps systems stable under stress.

    Step 4: Result recording

    Success/failure feedback feeds the breaker.


    This closes the loop.


    The executor isn’t just running code —

    it’s adapting to system health.



    Why decorators fail here

    Decorators look composable.


    They aren’t.


    Each decorator:
    • owns its own time model
    • owns its own retry logic
    • owns its own failure semantics


    Stack them and you get:


    emergent behavior you didn’t design


    The executor forces:
    • a shared clock
    • a shared failure model
    • a shared execution lifecycle


    That’s what makes the system predictable.



    The clock abstraction (the hidden hero)

    Time is the hardest dependency in resilience systems.


    Retries depend on time.

    Rate limiting depends on time.

    Circuit breakers depend on time.


    If every component calls time.time() directly:


    You lose control.


    LimitPal introduces a pluggable clock:






    class Clock(Protocol):
    def now(self) -> float: ...
    def sleep(self, seconds: float) -> None: ...
    async def sleep_async(self, seconds: float) -> None: ...







    Everything uses this.


    Not system time.


    Production clock

    Uses monotonic time:
    • immune to system clock jumps
    • safe under NTP sync
    • stable under container migrations


    MockClock

    Tests become deterministic:






    clock.advance(5.0)







    No waiting.

    No flakiness.

    No race conditions.


    You can simulate minutes of retry behavior instantly.


    This isn’t a testing trick.


    It’s architectural control over time.



    Circuit breaker architecture

    The breaker is a state machine:






    CLOSED → OPEN → HALF_OPEN → CLOSED







    But the tricky part isn’t the states.


    It’s transition discipline.


    CLOSED

    Normal operation.


    Failures increment a counter.

    Success resets it.


    When threshold reached → OPEN.

    OPEN

    All calls fail immediately.


    No retry.

    No limiter usage.


    Just fast rejection.


    After recovery timeout → HALF_OPEN.

    HALF_OPEN

    Limited probing phase.


    We allow a small number of calls.


    If they succeed → CLOSED.

    If they fail → back to OPEN.


    This prevents retry storms after recovery.


    The breaker is not just protection.


    It’s a stability regulator.



    Why retry must be jittered

    Exponential backoff without jitter is dangerous.


    If 1,000 clients retry at the same time:


    You get a synchronized spike.


    You kill the service again.


    Jitter spreads retries across time.


    Instead of:






    all retry at t=1s







    You get:






    retry in [0.9s, 1.1s]







    Small randomness → large stability gain.


    This is one of those details that separates toy resilience

    from production resilience.



    Key-based isolation

    Limiters operate per key:






    user:123
    tenant:acme
    ip:10.0.0.1







    Each key gets its own bucket.


    This prevents one bad actor

    from starving everyone else.


    Internally this means:
    • dynamic bucket allocation
    • TTL eviction
    • bounded memory
    • optional LRU trimming


    Without this,

    rate limiting becomes a memory leak.



    Sync + async parity

    Most Python libraries choose:
    • sync OR async


    LimitPal enforces parity.


    Same API.

    Different executor.






    executor.run(...)
    await executor.run(...)







    No hidden behavior differences.


    This matters when codebases mix:
    • background workers
    • HTTP servers
    • CLI tools


    One mental model everywhere.





    The real goal

    LimitPal isn’t about rate limiting.


    Or retry.


    Or circuit breakers.


    It’s about:


    making failure behavior explicit and composable


    Resilience stops being ad-hoc glue

    and becomes architecture.


    That’s the difference between:


    “I added retry”


    and


    “I designed a failure strategy.”





    What’s next

    Planned work:
    • observability hooks
    • adaptive rate limiting
    • Redis backend
    • bulkhead pattern
    • framework integrations


    Because resilience doesn’t end at execution.

    It extends into operations.





    Closing thought

    Distributed systems fail.


    That’s not optional.


    What’s optional is whether failure behavior is:
    • accidental
    • or engineered


    LimitPal is an attempt to engineer it.


    Docs:

    A fast, modular rate limiting library for Python with sync and async support.



    Repo:




    If you like deep infrastructure tools — feedback welcome.




    More...
Working...